Recently, the revision of the ratings caused a friend and I to drift apart so far that we are not allowed to play OGS games against each other anymore. In my experience, we are not that far apart, but he seems to have gotten stuck. This caused me to question the efficiency of the rating system employed by OGS. And it brought back memories of an earlier life in which I was for many years an active member of the Ranking Review Committee for a world wide sport. That experience taught me a thing or two about rating systems. Allow me to share a few thoughts.
A system like Elo at least gets off to a good start in that win probability of player X over player Y can be derived from the rating difference Rating(X) – Rating(Y) via the Classical Win Probability formula. To the uninitiated observer, this may seem like a luxury, but it has far reaching practical consequences for design and comparison of rating systems. To begin with, for every game g in given batch of game results the quantities
OWhr(g) = Observed Win of the higher rated player in game g and
EWhr(g) = Expected Win of the higher rated player in game g
are given by
OWhr(g) = 1 if the higher rated player wins
OWhr(g) = 0 if the higher rated player loses
EWhr(g) = win probability of the higher rated player
Then the quantity
Sum {OWhr(g) – Ewhr(g)}
where the summation is over all games g in the batch. If we compute this sum for Elo and Glicko over the same batch of games we already have an objective comparison of the relative efficiency of these two systems, albeit a rather crude one. However, it can be refined. We can regard Sum(OWhr(g)) as random variable. It has
V = Sum{EWhg(g) * (1-EWhg(g))}
as variance. Then the standardized random variable
Z = Sum {OWhr(g) – Ewhr(g)} / sqrt (V)
is more useful because it can be compared to that of another system even if the batch sizes are not equal. However, it is still not satisfactory, because the game result is more readily predictable when EWhr(g) is well above 50% than when it is near 50% and yet they were given equal weight. To overcome this weakness, we divide the probability interval 50% to 100-% in (say) 20 subintervals
I_1 = 50 to 52.5, I_2 = 52.5 to 55, … etc
and compute Z_k as above for every I_k. This paves the way for the computation of the widely used Chi-squared statistic. It is a useful tool for comparing one rating system with another as regards rating accuracy.
By using this statistical tool, one can optimise the Elo parameters for GO. There is no reason to believe that the parameter chosen originally by Arpad Elo for Chess is optimal for GO. One could similarly optimize the parameters for Glicko or any other system and then choose the one that works best for GO according to the Chi-squared statistic. There are several excellent known rating systems besides Elo and Glicko. Why not use the best of them for GO?