As you might have noticed we had a server outage for about 7 hours last night (or day depending on where you live!). The root cause was a database server failure, and the reason why it lasted so long is because it happened just after I went to bed . We are still unclear as to what caused the failure, but we’re continuing to look into it to try and prevent whatever happened from happening again.
The database server is in a manually patched state right now, and as we mentioned we’ll be working on figuring out what exactly when wrong so we can implement a proper fix. We will likely have a scheduled maintenance within the next week or two to rollout a proper fix for the database system.
I’m also going to be working on some implementing some better monitoring and alerting functionality so I have a better chance of being woken up when there’s a huge outage like this.
Thanks for the information, anoek, and good luck with finding the cause.
I’m not sure if automatically waking you in case of an outage is a good idea. For most of us not being able to play go for a couple of hours is not a life threatening situation after all. (And may I suggest therapy for those for whom it is? )
WARNING: For correspondence games, the BLACK CIRCLE may not indicate all the games awaiting your move. During the outage, I went into ANALYZE mode on some games, but did not move. After the outage, those analyzed games were not included in the BLACK CIRCLE. I then went to HOME, where all games awaiting my move are highlighted in green. This has happened to me before. After making all my moves, the BLACK CIRCLE returns to being correct in the future.
I don’t favour hiring another dev or admin ( ) … Since OGS is 100% directly crowdfunded now, directly in the sense that when there were adds, we were already supporting OGS indirectly (or is that directly? nevermind …)
So I prefer the raised money go for new servers and it’s not only because many, me included, suffer from latency.
It’s because we know OGS can be run on sharded servers now and sharding is a good practice for improving stability.
Of course, I have no guess on what the root cause of this outage was, in other words, I don’t know if sharding would have prevented it. But, I’m sure what we want is: a better global stability. We don’t really care how frequent this kind of incident can occur in the future, we do care for the overall stability of OGS.
Thus, I prefer the raised money go for architecture evolutions OGS needs now, not for hiring another guy (or woman ).
However, don’t get me wrong: I also don’t like key member(s) in a team … I’m just talking of priorities.
My understanding is that only anoek is a paid dev and that matt is volunteer like anoek used to be… theoretically our hypothetical third dev could also be a volunteer, thus leaving our hard donated money to go towards server expansion
Though, maybe the moderation team, who are people that are already trusted, could be given access to sysop tools to be able to check stuff if something goes wrong?