Calculable efficiency of rating systems

Recently, the revision of the ratings caused a friend and I to drift apart so far that we are not allowed to play OGS games against each other anymore. In my experience, we are not that far apart, but he seems to have gotten stuck. This caused me to question the efficiency of the rating system employed by OGS. And it brought back memories of an earlier life in which I was for many years an active member of the Ranking Review Committee for a world wide sport. That experience taught me a thing or two about rating systems. Allow me to share a few thoughts.

A system like Elo at least gets off to a good start in that win probability of player X over player Y can be derived from the rating difference Rating(X) – Rating(Y) via the Classical Win Probability formula. To the uninitiated observer, this may seem like a luxury, but it has far reaching practical consequences for design and comparison of rating systems. To begin with, for every game g in given batch of game results the quantities

OWhr(g) = Observed Win of the higher rated player in game g and
EWhr(g) = Expected Win of the higher rated player in game g

are given by

OWhr(g) = 1 if the higher rated player wins
       	OWhr(g) = 0 if the higher rated player loses
EWhr(g) = win probability of the higher rated player

Then the quantity

Sum {OWhr(g) – Ewhr(g)}

where the summation is over all games g in the batch. If we compute this sum for Elo and Glicko over the same batch of games we already have an objective comparison of the relative efficiency of these two systems, albeit a rather crude one. However, it can be refined. We can regard Sum(OWhr(g)) as random variable. It has

V = Sum{EWhg(g) * (1-EWhg(g))}

as variance. Then the standardized random variable

Z = Sum {OWhr(g) – Ewhr(g)} / sqrt (V)

is more useful because it can be compared to that of another system even if the batch sizes are not equal. However, it is still not satisfactory, because the game result is more readily predictable when EWhr(g) is well above 50% than when it is near 50% and yet they were given equal weight. To overcome this weakness, we divide the probability interval 50% to 100-% in (say) 20 subintervals

I_1 = 50 to 52.5, I_2 = 52.5 to 55, … etc

and compute Z_k as above for every I_k. This paves the way for the computation of the widely used Chi-squared statistic. It is a useful tool for comparing one rating system with another as regards rating accuracy.

By using this statistical tool, one can optimise the Elo parameters for GO. There is no reason to believe that the parameter chosen originally by Arpad Elo for Chess is optimal for GO. One could similarly optimize the parameters for Glicko or any other system and then choose the one that works best for GO according to the Chi-squared statistic. There are several excellent known rating systems besides Elo and Glicko. Why not use the best of them for GO?


Here are some work-arounds:

  1. You can still play unranked games with your friend.
  2. If you want to play ranked games with them, you can challenge them on a ladder (even a private one in a private group is an option).
  3. Joining the same tournaments is another way to play ranked games together.

I don’t know enough about rating systems to comment on the rest, but it sure sounds like you know what you are talking about.


I don’t really see a relation between your issue of playing rated games with your friend and your proposed rating system. Do you think your issue is caused by inaccuracy of OGS’ rating system, which would be fixed by your proposal?


so, I kind of see what you’re trying to do, but would not a more standard measure work here?

Say L is the expected win rate of “Left”, and R is the expected win rate of “Right”, (such that L+R=1), and s is the score of the game (1 for left win 0 for right win), the formula for how much “information” comes out of the game (in nats) being ln( L ) * s + ln( R ) * ( 1 - s ) ? (You can change the base of the logarithm for whatever units make sense, such as log base 2 being in "bits“), and then sum over all data points? Then select parameters to minimize this, as information being a measure of “surprisal” you would want the system to be minimally surprised by the data set, or as a measure of its efficiency compared to data sets of different size, divide by the total number of data points for “average information of each game”


You can share your thoughts in code:

I’d think as a metric it would be natural to use binary cross-entropy loss. It is used in evaluating binary classifiers, and rating system is a kind of binary classifier. We have predicted probability for the game result p and the real result y, then the loss function from the closest google result:


Or some other classification function. After all, smart people use them.

Jokes aside, maybe your friend just needs to get good. It’s always rating system’s fault, isn’t it.


Because, as you have surely seen in the last year, it is a completely massive effort and disruptive activity to make changes to the ranking system of an active site. If you want to change even a single site, it is a massive undertaking. If you want to standardise for “GO”, it’s an almost inconceivably large task.

There will always be another better tool arrive just after you implement something, just like there’s always a better PC or car just after you buy one.

The question is not “why not use the best?”. The question is “would any change right now be worthwhile - what tangible benefit would come from changing now?”.

Your post doesn’t touch the question of “what tangible change would be experienced by optimising in the way you suggested?”


No, I don’t expect that an improved rating system would necessarily diminish the ratings gap between me and my friend. It was just what cause me to think about the efficiency of the system being used.
What I suggested is worthy of consideration regardless of the immediate motive for bringing it up.

Thank you, these are useful suggestions.
Will follow up on them.

1 Like

Thank you, gthese are useful suggestions. Will follow up on them.

It was a massive effort for the committee to come up with the tools I outlined. Now that they are known the task comes down to writing the code for implementation. That is not massive.

Giving every player a new rating along with an explanation of why it is believed to be the most accurate among those considered, does not strike me as a disruptive process.

It would be in the eye of the beholder whether the change is tangible or intangible.

You have not indicated whether your idea has been put to a practical test. Has it?

This type of reply seems dismissive, by suggesting that delivering code is necessary or preferable. However, it is valid to have a discussion about the mathematics of rating systems, while leaving the implementation with code as an independent issue to be dealt with after resolving the mathematical design.

This is the same cross-entropy loss suggested by @S_Alexander:

In fact, the update equations for Elo (and similarly for the related Glicko system) are highly related to cross-entropy loss!

See: ELO Ratings: It's not the player that improves - it's the estimate

Essentially, the update equation for adjusting Elo ratings is like making a gradient descent step to update the logistic loss (aka the binary cross-entropy loss).


Writing the rating code is only a small amount of the task of changing the rating system for a site.

I’m not sure what this observation adds to the conversation - it seems obvious.

What is equally obvious is that a person proposing a change needs to present it in a persuasive way so that enough beholders percieve it as tangibly beneficial. In fact, we could throw away the word tangible… as the proposer of a change, you need to present the proposal in a way that persuades the beholders to want it.

I was trying to answer your question of “why would we not change to this” - we would not change because no proposal has been presented that indicates why it would be worth the (very substantial) effort. That is the answer to your question.

This seems to have gotten tangled up in words. Seriously: it’s hard to successfully deploy a change to a rating system - it is disruptive. Why would we bother?

You seem excited about this efficieny thing, but what benefits does it bring?


I was not making a proposal and I am not going to do that. I wanted merely to draw attention to a possibility.

Wherever a rating system is used, the public being served is better served if the rating system can be shown to be best possible in some sense.

It is up to the OGS powers that be to follow up if they want to. I can provide more detailed information if they are interested. If they are not interested, I will not be taking any further action.

1 Like

That’s all good - it’s good to be aware of possibilities.

You did ask “why don’t we change?”, and I did answer that: such a change is expensive and would have to be worth the effort.

You didn’t explain (at least in layman’s terms) what benefit anyone would get from such a change.

That would leave it on the shelf until this benefit was understood - some abstract concept of “best” is not helpful.

That’s not true. It’s not sufficient to be “best in some sense”. The community is better served by devoting resources to making changes that have well quantified benefits with good return of investment on the change.

I think it would be great to learn what “more efficient rating system” even means…

As a user of the rating system, I care about:

  • Stability - does the response to each result feel proportionate, or does the result vary wildly with every new result?
  • Accuracy - does the comparison of rating result in good predictions about who will win, and how easily
  • Speed of convergence - how many results does it take to get to a stable and accurate rating

I think those are the main things that seem to matter.


I’m not going to dig through the extraordinarily long threads to find it, but Anoek posted a statistical analysis on the performance of the current system (or a minor variant of it) a while ago, and it was rather good from an accuracy and stability perspective.


Is it this post?

1 Like

Rating update windows were removed again in the 2021 rating system update.


I think most people care about the things you mention.
The delivery of these features are far from straight forward. For example, while stability is a virtue, the system should ideally detect rapid improvers and allow them to reach their stable stage more quickly. But such refinements should not be done on the basis of guess work. An optimal way to process rapid improvers should be introduced. But optimization requires a calculable way to assess system efficiency. It is possible to have that as a tool.