As you’ve probably noticed, from about 2030 UTC tonight, OGS has been having some trouble. We’ll update here when we’ve got more information.
Update and postmortem: [Fixed] OGS is having stability issues, stay tuned: 2016-09-27
As you’ve probably noticed, from about 2030 UTC tonight, OGS has been having some trouble. We’ll update here when we’ve got more information.
Update and postmortem: [Fixed] OGS is having stability issues, stay tuned: 2016-09-27
you mean i’m going to have to actually work at the office?
Good thing my go club meets tonight! =]
“2030 UTC” <<< Oh well, OGS is hosted on quantum computer, and seems like the bug is just that they travelled through time
Hey folks, sorry for the downtime… we’re having an issue with one of our databases and are working to get it back online.
It’s obviously crodgers fault
Alrighty folks, we’re back… sorry about that, here. Here’s a tldr for the postmortem:
The database that we use to manage our live games, reviews, chats, and some other things decided that after 4 years of dedicated service it was going to crash… hard… We weren’t able to recover around 58 reviews, though I’m going to check our backups tomorrow. While the service was degraded we might have also lost some moves. Let us know if you were affected by this and we’ll see what we can do about it.
This database is part of the system that is slated to be replaced within the next few weeks as we move towards a more scalable and fault tolerant system.
Here’s a bit of a longer explanation for those that are interested in some technical details:
The database we use to do what I mentioned in a previous paragraph is Redis… normally it’s super reliable but earlier today we noticed the pod wouldn’t stay running for more than a few minutes. Upon closer inspection it was segfaulting trying to perform actions against the keys that hold reviews.
Best we can tell is something caused some corruption in our checkpoint file which caused redis’s ziplist loader to not be able to continue. We’re not sure why this happened, we’ll see if we can figure that part out. We worked to recover the rest of the database but unfortunately lost a few of the Reviews mentioned above.
Part of the reason we have been working on our infrastructure so hard lately is to reduce and eliminate single points of failures like this. We know you rely on our service and we’re committed to making it faster, better, and more reliable in every way.
Is this the right place to ask about issues RE the outage? Sorry if it’s not.
I went into vacation mode around 2-3pm Pacific time today in response to the outage. When i came back online just now, i notice that i am indeed in vacation mode, and my vacation time is decreasing and less than it was earlier today when i turned it on. But also, my time in all my games is also going down, it’s less than it was when i set on vacation time, and my games don’t say they’re in vacation. So i’m losing vacation time and game time simultaneously.
Edit: I turned off vacation mode so my time would stop draining
Thx. keep up the good work.
suggestion: well, now i’ve seen the “Site is under maintenance” screen. when i looked this morning it had a pointer to a URL about the outage…about a previous outage. might want to address that in your copious free time. (:
So, the moral of the story is that NoSQL databases are not @crodgers -compatible? I’ll keep that in mind.
this is so* off topic, but as someone in the business, i’m curious as what you’re migrating to?
Another slight update, issue filed against redis which has yet more technical detail: https://github.com/antirez/redis/issues/3527
We’ll still be using redis for a lot of stuff, though we plan on switching over to a redis cluster as opposed to the single node we have now.
For game and review storage though we do plan on moving that out of the hybrid redis+postgres system we have now and switching to storing all state in cassandra