tl;dr We’re switching to a Glicko-2 rating system, which should be great for a lot of reasons. All of your past ranked games have been accounted for during this transition.
For the past several years we’ve been using a rating and ranking system based on Elo developed by the European Go Federation. It has served us reasonably well, but there have been some shortcomings when applying that system in an online setting. Most of the shortcomings can be traced back to the fact that the system is too slow to find a player’s correct rank, and too slow to adapt when jumps in strength occur. Operationally this has been quite troublesome. Not only do we have to require new players to provide an initial rank when they sign up (which is confusing for new players, and hard to answer correctly for established players because a rank only means something per server), it also opens up the system for abuse, which moderators have to combat on a daily basis. Moderators also have to manually adjust ranks a lot for legitimate reasons, which requires effort from both players and moderators; and if that doesn’t happen, it can take dozens and sometimes hundreds of games to find their correct rank, all the while they are affecting other players’ ranks negatively along the way.
The problem of slow moving ratings is a well-known problem with Elo implementations. In response to this, Prof. Mark Glickman developed the Glicko, and later Glicko-2, rating systems which address this problem very well and are fairly widely used. As such, we decided to do some testing with Glicko-2 to see if it would could be a good fit here at OGS. After testing with ~5M ranked games, there was no question that the Glicko-2 system outperformed the Elo based system and did so without any manual intervention or bootstrapping like we’ve had to do with the Elo system, which is very encouraging and should solve a lot of problems we’ve had.
The net result of this change for players is that we should be able to do a better job in matching players and computing a better handicap between the players, and your rank should adapt much quicker to your current strength. For the moderation team, it means much less work adjusting ranks and having to deal with sandbaggers or fake dans.
#What you can expect:
Your Glicko-2 ratings will be very different numbers than your Elo ratings. They are different systems and thus not really comparable.
Your rank may change with this update. Most players’ ranks will stay about the same, plus or minus one rank. However, some players have adjusted significantly. Please give your new ranking a try for a few games before asking for an adjustment. You may find your new ranking is more appropriate. If it’s not, playing a few games should start adjusting you fairly quickly to get you to where you need to be. Moderators cannot adjust your ranks at this time, but if you feel like there is a notable problem with your rank after playing at least 10 games, feel free to contact @anoek.
The dan ranks have naturally separated out a bit more than they were with the Elo system.
During our analysis we’ve seen that players whose ranks are less than 25k are very difficult to adequately rank in the sense that handicaps at that level are not a reliable way of accounting for strength differences. As such, we’re now only assigning ranks to players who are strong enough to play with handicaps, which seems to equate to being about 25k and above. The only real effect of this is that when two players whose’ ranks would be less than 25k play each other, they will play without a handicap.
Games against bots which result in a timeout will not be rated.
Correspondence games between two players that rank difference of more than 2 and end in a timeout will no longer be rated.This change is being redacted due to pushback from the community.
Handicaps will no longer be enabled by default for any automatch games between players whose rank is less than 25k. When matches are made between players and one player is above the 25k threshold, the handicap will be computed as though the other player was 25k. This is not retroactive.
Handicaps will not be enabled by default in automatch games for players with a high “deviation”, i.e., a lack of confidence about their rank. As their rank adjusts and their deviation drops, handicaps will be enabled gradually. This will primarily affect games involving new players as we find an appropriate rank for them.
Games will no longer be annulled after both players have made a move. This is a change from our previous behavior where cancellation and annulment would take place if the game ended prior to move 19 in a 19x19 game. This is not retroactive.
The Glicko-2 system has several parameters that can be adjusted: the initial rating, initial deviation, initial volatility, the number of games considered at one time, and frequency of deviation updates due to player inactivity. The suggested defaults by Glicko-2 are an initial rating of 1500, an initial deviation of 350, an initial volatility of 0.06, and to process games in batches of around 10-15.
We explored a lot of different values trying to optimize the system for our purposes, and especially when group sizes of 1 were used, there were several different choices for initial deviation that produced marginally better results. However, when using group sizes of 10-15, just as Prof. Glickman recommends, the other recommended settings were pretty much optimal. As such, our implementation just uses the defaults as laid out by Prof. Glickman, because they’re great.
Grouping as envisioned by Prof. Glickman essentially works by considering many games at once, which for in-person tournaments is not a problem. For online play where we have a lot of ad-hoc games, we needed to get a little creative, so our solution is as follows: We maintain a running tally of up to 15 games recently played, along with a “base” glicko-2 rating and a “current” rating. Your “current” rating is the rating that everyone sees and is used when match making and computing handicaps. When you finish a game, we add the game to the list of games, and compute your new “current” rating by applying the results of those games to your “base” rating. When recomputing, we also use your opponents’ “current” ratings as opposed to simply the ratings they were when the game originally concluded. Using this technique consistently produced better results in our experiments, and we later found that Prof. Glickman applied some techniques akin to this in his Glicko-Boost work in 2010, which made us feel better about doing this as opposed to simply using the original ratings. Once 15 games has been played, the “current” rating is finalized and the list is reset. We’ll also finalize the “current” rating after 30 days, regardless of how many games have been played.