OGS has a new Glicko-2 based rating system! [2017]

tl;dr We’re switching to a Glicko-2 rating system, which should be great for a lot of reasons. All of your past ranked games have been accounted for during this transition.


Hello OGS!

For the past several years we’ve been using a rating and ranking system based on Elo developed by the European Go Federation. It has served us reasonably well, but there have been some shortcomings when applying that system in an online setting. Most of the shortcomings can be traced back to the fact that the system is too slow to find a player’s correct rank, and too slow to adapt when jumps in strength occur. Operationally this has been quite troublesome. Not only do we have to require new players to provide an initial rank when they sign up (which is confusing for new players, and hard to answer correctly for established players because a rank only means something per server), it also opens up the system for abuse, which moderators have to combat on a daily basis. Moderators also have to manually adjust ranks a lot for legitimate reasons, which requires effort from both players and moderators; and if that doesn’t happen, it can take dozens and sometimes hundreds of games to find their correct rank, all the while they are affecting other players’ ranks negatively along the way.

The problem of slow moving ratings is a well-known problem with Elo implementations. In response to this, Prof. Mark Glickman developed the Glicko, and later Glicko-2, rating systems which address this problem very well and are fairly widely used. As such, we decided to do some testing with Glicko-2 to see if it would could be a good fit here at OGS. After testing with ~5M ranked games, there was no question that the Glicko-2 system outperformed the Elo based system and did so without any manual intervention or bootstrapping like we’ve had to do with the Elo system, which is very encouraging and should solve a lot of problems we’ve had.

The net result of this change for players is that we should be able to do a better job in matching players and computing a better handicap between the players, and your rank should adapt much quicker to your current strength. For the moderation team, it means much less work adjusting ranks and having to deal with sandbaggers or fake dans.

#What you can expect:

  • Your Glicko-2 ratings will be very different numbers than your Elo ratings. They are different systems and thus not really comparable.

  • Your rank may change with this update. Most players’ ranks will stay about the same, plus or minus one rank. However, some players have adjusted significantly. Please give your new ranking a try for a few games before asking for an adjustment. You may find your new ranking is more appropriate. If it’s not, playing a few games should start adjusting you fairly quickly to get you to where you need to be. Moderators cannot adjust your ranks at this time, but if you feel like there is a notable problem with your rank after playing at least 10 games, feel free to contact @anoek.

  • The dan ranks have naturally separated out a bit more than they were with the Elo system.

  • During our analysis we’ve seen that players whose ranks are less than 25k are very difficult to adequately rank in the sense that handicaps at that level are not a reliable way of accounting for strength differences. As such, we’re now only assigning ranks to players who are strong enough to play with handicaps, which seems to equate to being about 25k and above. The only real effect of this is that when two players whose’ ranks would be less than 25k play each other, they will play without a handicap.

#Other changes:

  • Games against bots which result in a timeout will not be rated.

  • Correspondence games between two players that rank difference of more than 2 and end in a timeout will no longer be rated. This change is being redacted due to pushback from the community.

  • Handicaps will no longer be enabled by default for any automatch games between players whose rank is less than 25k. When matches are made between players and one player is above the 25k threshold, the handicap will be computed as though the other player was 25k. This is not retroactive.

  • Handicaps will not be enabled by default in automatch games for players with a high “deviation”, i.e., a lack of confidence about their rank. As their rank adjusts and their deviation drops, handicaps will be enabled gradually. This will primarily affect games involving new players as we find an appropriate rank for them.

  • Games will no longer be annulled after both players have made a move. This is a change from our previous behavior where cancellation and annulment would take place if the game ended prior to move 19 in a 19x19 game. This is not retroactive.

Glicko-2 implementation notes for the curious:

The Glicko-2 system has several parameters that can be adjusted: the initial rating, initial deviation, initial volatility, the number of games considered at one time, and frequency of deviation updates due to player inactivity. The suggested defaults by Glicko-2 are an initial rating of 1500, an initial deviation of 350, an initial volatility of 0.06, and to process games in batches of around 10-15.

We explored a lot of different values trying to optimize the system for our purposes, and especially when group sizes of 1 were used, there were several different choices for initial deviation that produced marginally better results. However, when using group sizes of 10-15, just as Prof. Glickman recommends, the other recommended settings were pretty much optimal. As such, our implementation just uses the defaults as laid out by Prof. Glickman, because they’re great.

Grouping as envisioned by Prof. Glickman essentially works by considering many games at once, which for in-person tournaments is not a problem. For online play where we have a lot of ad-hoc games, we needed to get a little creative, so our solution is as follows: We maintain a running tally of up to 15 games recently played, along with a “base” glicko-2 rating and a “current” rating. Your “current” rating is the rating that everyone sees and is used when match making and computing handicaps. When you finish a game, we add the game to the list of games, and compute your new “current” rating by applying the results of those games to your “base” rating. When recomputing, we also use your opponents’ “current” ratings as opposed to simply the ratings they were when the game originally concluded. Using this technique consistently produced better results in our experiments, and we later found that Prof. Glickman applied some techniques akin to this in his Glicko-Boost work in 2010, which made us feel better about doing this as opposed to simply using the original ratings. Once 15 games has been played, the “current” rating is finalized and the list is reset. We’ll also finalize the “current” rating after 30 days, regardless of how many games have been played.

52 Likes

Great writeup, thank you! Looking forward to seeing how this plays out for all of us. :o)

5 Likes

totally in favor in trying a new rating system.

you didn’t say when this cuts in?

(oh - right now)

2 Likes

It’s live now! Technically speaking, both rating systems are still active, but the new system is the one actively being used. If something goes horribly wrong with it, we can revert to the old one.

4 Likes

Love the size/speed grid.

Which ranking does Automatch use now?

3 Likes

The “Overall” rating is used for everything, the rest are just informational.

6 Likes

Hi everyone, I made a poll to see how much ranks changed compared to old system: http://poal.me/z4y46p

Of course, if developers already have this statistics, would be interesting to hear it!

2 Likes

Average was +2 ranks

4 Likes

Correspondence games between two players that rank difference of more than 2 and end in a timeout will no longer be rated.

What about games finished before the switch?

Also, this seems like a source of abuse of the system, since people can deliberately time out lost games if the rank difference is large enough.

4 Likes

I like the breakdown by speed and board size, but there doesn’t seem to be any relationship between the ratings? My overall rating is higher than any of the specific ratings, and my highest specific rating is in blitz 13x13 which I barely play and don’t really recall being any good at. I guess it’s just having a hard time giving a rating there because there are so few games? But it feels like all of these are on different scales, which makes it hard to interpret them.

I have a friend in my list who has played almost 300 ranked games and his rating is listed as N. I’m wondering how that could be as I assume that means there isn’t enough data to produce a rating for that person yet?

This was a “bug” in a sense, I was using old information for a cutoff value. If you refresh and view their profile again it should be fixed

3 Likes

And so it is. Thanks for clearing that up.

This is true, they are on different scales. This is also the reason why there are no ranks associated with the broken out ratings. To be honest it’s an open question as to whether we should keep the breakouts or not, at this point they are strictly informational.

3 Likes

I’m all for strictly informational, I’m just not sure what information I’m supposed to get out of them.

Edit: I guess if, say, you’re getting stronger at correspondence but stagnating in blitz it’ll show you that. It would help a lot if you could assign even some informal kyu ranks to those for comparison purposes, though.

2 Likes

I think at this point they are just adding to the confusion. I suspect they will be removed but that’s just my speculation.

1 Like

This was applied retroactively as it resulted in a notable improvement in the performance of the rating system.

The potential for abuse is certainly possibility and it’s something I’m monitoring, though in practice I don’t think it’s as big of a risk as one might think. But we’ll find out :slight_smile:

5 Likes

If I create game with rank restriction 10k - 9k
From which exactly glicko rating to which opponent need to be to accept it?

Hi,

I found one user 10d, is it expected?

Arnaud

1 Like