[Fixed] OGS is having stability issues, stay tuned: 2016-09-27

emauton · September 27, 2016, 9:12pm

As you’ve probably noticed, from about 2030 UTC tonight, OGS has been having some trouble. We’ll update here when we’ve got more information.

Update and postmortem: [Fixed] OGS is having stability issues, stay tuned: 2016-09-27

yuri · September 27, 2016, 10:33pm

you mean i’m going to have to actually work at the office?

polar-bear · September 27, 2016, 10:45pm

Good thing my go club meets tonight! =]

hl037 · September 27, 2016, 11:04pm

“2030 UTC” <<< Oh well, OGS is hosted on quantum computer, and seems like the bug is just that they travelled through time

matburt · September 27, 2016, 11:53pm

Hey folks, sorry for the downtime… we’re having an issue with one of our databases and are working to get it back online.

ajventi · September 28, 2016, 12:20am

It’s obviously crodgers fault

trohde · September 28, 2016, 12:28am

mlopezviedma · September 28, 2016, 3:39am

Meanwhile, you could hang out in beta

matburt · September 28, 2016, 4:25am

Alrighty folks, we’re back… sorry about that, here. Here’s a tldr for the postmortem:

The database that we use to manage our live games, reviews, chats, and some other things decided that after 4 years of dedicated service it was going to crash… hard… We weren’t able to recover around 58 reviews, though I’m going to check our backups tomorrow. While the service was degraded we might have also lost some moves. Let us know if you were affected by this and we’ll see what we can do about it.

This database is part of the system that is slated to be replaced within the next few weeks as we move towards a more scalable and fault tolerant system.

Here’s a bit of a longer explanation for those that are interested in some technical details:

The database we use to do what I mentioned in a previous paragraph is Redis… normally it’s super reliable but earlier today we noticed the pod wouldn’t stay running for more than a few minutes. Upon closer inspection it was segfaulting trying to perform actions against the keys that hold reviews.

Best we can tell is something caused some corruption in our checkpoint file which caused redis’s ziplist loader to not be able to continue. We’re not sure why this happened, we’ll see if we can figure that part out. We worked to recover the rest of the database but unfortunately lost a few of the Reviews mentioned above.

Part of the reason we have been working on our infrastructure so hard lately is to reduce and eliminate single points of failures like this. We know you rely on our service and we’re committed to making it faster, better, and more reliable in every way.

insertwit · September 28, 2016, 4:49am

Is this the right place to ask about issues RE the outage? Sorry if it’s not.

I went into vacation mode around 2-3pm Pacific time today in response to the outage. When i came back online just now, i notice that i am indeed in vacation mode, and my vacation time is decreasing and less than it was earlier today when i turned it on. But also, my time in all my games is also going down, it’s less than it was when i set on vacation time, and my games don’t say they’re in vacation. So i’m losing vacation time and game time simultaneously.

Edit: I turned off vacation mode so my time would stop draining

will1951 · September 28, 2016, 5:36am

Thx. keep up the good work.

ahd1 · September 28, 2016, 5:48am

suggestion: well, now i’ve seen the “Site is under maintenance” screen. when i looked this morning it had a pointer to a URL about the outage…about a previous outage. might want to address that in your copious free time. (:

temifar · September 28, 2016, 6:06am

So, the moral of the story is that NoSQL databases are not @crodgers -compatible? I’ll keep that in mind.

yuri · September 28, 2016, 2:57pm

this is so* off topic, but as someone in the business, i’m curious as what you’re migrating to?

matburt · September 28, 2016, 3:51pm

Another slight update, issue filed against redis which has yet more technical detail: https://github.com/antirez/redis/issues/3527

anoek · September 28, 2016, 5:29pm

We’ll still be using redis for a lot of stuff, though we plan on switching over to a redis cluster as opposed to the single node we have now.

For game and review storage though we do plan on moving that out of the hybrid redis+postgres system we have now and switching to storing all state in cassandra