Recent server outage

Hi All,

As you might have noticed we had a server outage for about 7 hours last night (or day depending on where you live!). The root cause was a database server failure, and the reason why it lasted so long is because it happened just after I went to bed :frowning:. We are still unclear as to what caused the failure, but we’re continuing to look into it to try and prevent whatever happened from happening again.

The database server is in a manually patched state right now, and as we mentioned we’ll be working on figuring out what exactly when wrong so we can implement a proper fix. We will likely have a scheduled maintenance within the next week or two to rollout a proper fix for the database system.

I’m also going to be working on some implementing some better monitoring and alerting functionality so I have a better chance of being woken up when there’s a huge outage like this.

Sorry to all affected :frowning:

– anoek

12 Likes

Thanks for the information, anoek, and good luck with finding the cause.

I’m not sure if automatically waking you in case of an outage is a good idea. For most of us not being able to play go for a couple of hours is not a life threatening situation after all. (And may I suggest therapy for those for whom it is? :wink: )

(Edited for typo correction.)

6 Likes

Obviously that server is trying to place you in Damezumari.

Mogadeet

3 Likes

WARNING: For correspondence games, the BLACK CIRCLE may not indicate all the games awaiting your move. During the outage, I went into ANALYZE mode on some games, but did not move. After the outage, those analyzed games were not included in the BLACK CIRCLE. I then went to HOME, where all games awaiting my move are highlighted in green. This has happened to me before. After making all my moves, the BLACK CIRCLE returns to being correct in the future.

1 Like

Hello,

As a lambda user, in case voices of lambda end users count :slight_smile:

I would prefer OGS team to focus more on how to improve stability and post-mortem analysis&tools rather than change alerting system.

I don’t think it’s worth to wake anyone up for that even if it prevents some folks from sleeping :slight_smile:

Arnaud

8 Likes

As an alternative to getting woken up, maybe consider recruiting another dev that lives on the opposite side of the world?

1 Like

puts hand up

Hello,

I don’t favour hiring another dev or admin ( :wink: ) … Since OGS is 100% directly crowdfunded now, directly in the sense that when there were adds, we were already supporting OGS indirectly (or is that directly? nevermind …)

So I prefer the raised money go for new servers and it’s not only because many, me included, suffer from latency.
It’s because we know OGS can be run on sharded servers now and sharding is a good practice for improving stability.
Of course, I have no guess on what the root cause of this outage was, in other words, I don’t know if sharding would have prevented it. But, I’m sure what we want is: a better global stability. We don’t really care how frequent this kind of incident can occur in the future, we do care for the overall stability of OGS.

Thus, I prefer the raised money go for architecture evolutions OGS needs now, not for hiring another guy (or woman :wink: ).

However, don’t get me wrong: I also don’t like key member(s) in a team … I’m just talking of priorities.

It’s only my opinion I’m sharing here, obviously :slight_smile:

Arnaud

My understanding is that only anoek is a paid dev and that matt is volunteer like anoek used to be… theoretically our hypothetical third dev could also be a volunteer, thus leaving our hard donated money to go towards server expansion :slight_smile:

\o Ooh, pick me, pick mee.

Though, maybe the moderation team, who are people that are already trusted, could be given access to sysop tools to be able to check stuff if something goes wrong?

Nope nope nope.

@anoek and @matburt are, together, the developers and owners of OGS.

4 Likes

Thank you for the clarification :slight_smile:

1 Like