July 29 server problems post-mortem

Hi All,

Many of you noticed a lot of strange problems yesterday and I thought I’d write up what happened for those curious about what happened from a technical standpoint. It took me a few hours to get to the root cause of this one.

A few days prior, I updated the interface with a notable refactor of the game page and game logic. As part of this, I inadvertently added a change that resulted in any MiniGoban or GobanLine loaded also join the chat room associated with the game. It takes a few days for everyone’s browsers to update, and initially the extra server load from all of these extra chat channel joins wasn’t noticeable.

Yesterday, however, our chat server (which also handles automatch due to similar requirements) hit critical mass when a “termination server” (part of our fan-out connection handling strategy) reconnected after internal networking issues, the large influx of joins and subsequent updates caused multi-second delays. This also generated excessive logs caused by the delays, which further delayed processing and triggered cascading failures as other termination servers reconnected, and was further exacerbated by the backing up of other queries from other services that were also then retrying.

Additionally, because automatch runs in the same process, our comm server would initiate automatch requests but then not be able to keep up to deal with the response in time and so would retry them, even though they were successfully executed at times. While many of our internal API’s are idempotent by design, and all should be, this leg of the game creation process wasn’t and so we were seeing things like multiple games being created, players changing, all kinds of terrible behavior that shouldn’t have been possible under any condition.

So sorry for the troubles yesterday, but on the up side it was a good stress test and illuminated several weak points that would have been harder to fix on the fly if we hit them during a more natural player scaling, so I’ll be working on getting those fixed up.

– anoek

27 Likes

Thanks for sharing the postmortem, cascading failures are the worst (to experience, but fun to read about)! I hope you take a well-deserved break to relax a bit after what must have been an intense day of troubleshooting and conjuring up fixes!

12 Likes

Man i felt so famous when 50 people watched my game.. why did you have to tell me :wink:

10 Likes

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.