Is rating volatility a bug or a feature? [Forked]

paisley · December 28, 2022, 7:06pm

I know I’m coming into the middle of a long thread, and I haven’t read and fully comprehended everything that has been written so far (I have skimmed, but it’s long!), so I apologize if I repeat what’s already been stated. Regardless, I’ve spent some time thinking about these problems, and have some mathematical training, so I hope I can input something useful.

My preferred approach to design or evaluate a rating system would be as follows.

Build (or choose, since this has been done in the papers) a probabilistic model that attempts to account for things like the probability of game outcomes given strength differences, the likelihood of wild swings in player strength, etc. This model would probably come with some tweakable global parameters.
Evaluate a system (or a proposed system) in terms of the log-likelihood of the observed data (using the values of the global parameters that give the highest log-likelihood). There are at least two sources of likelihood here. First, the rating system assigns ratings to players that change over time, and the probabilistic model assigns some quantitative likelihood to these fluctuations. Second, there are game outcomes between players to whom the rating system has assigned a strength, and the probabilistic model assigns probabilities to those game outcomes.

The log-likelihood from step 2 is likely a negative number, and its inverse (so, after multiplying by -1 to make it positive), measures information (in bits, say, if the log is base 2). We can think this number as a quantitative measure of how much the rating system is “surprised” by the ratings and results data. (It is related to a lower bound on the number of bits needed to store the ratings-and-results data in a perfect data compression scheme, so it does actually measure some real-world quantity, but that’s not very important here.)

I like this approach to thinking about the problem, because it lets us think about ratings volatility intuitively as well as quantitatively. For example, consider @Uberdude’s hypothetical rating system that perfectly assigns either 30k or 9d to a player based on whether they will win their next game. If we have a system like that and want to store a record of the ratings-and-results data, we don’t have to record the actual game results, only the ratings, because the results are a foregone conclusion and implied by the ratings. There is one bit worth of surprise per game, which is whether black or white was the 9d player for that game; everything else follows.

However, that’s terribly inefficient: it’s the same amount of bits needed to store the results if we don’t have any rating system at all, and consider each game as a toss-up: one bit per game, to record the winner. This matches the intuition that such a system, while it perfectly predicts the game outcomes, isn’t a useful rating system.

A good rating system is a tradeoff, in this precise quantitative way: ratings that don’t change too much over time, so that ratings changes are predictable and therefore relatively “unsurprising” on average, and accurate game outcome predictions so that the game results themselves are unsurprising. It thus has the power to partially explain the observed results, which I think is part of what we appreciate in a good rating system (“Is this 4-win streak just a run of good luck, or have I gotten better?”)

TLDR: I vote that log-likelihood of observed results, in a model that accounts for both the likelihood of changes in strength and the probability of game outcomes, is the proper way to evaluate a rating system. My intuition is that OGS volatility is too high, if only because it doesn’t distinguish between games played back-to-back and games played after a long break, like a good model would. Consequently, I hypothesize that adjusting the system in a way that treats the volatility number as “ratings variance per wall-clock time” instead of “ratings variance per game played” would improve the situation at least a little bit.

PS: I also have a sneaking suspicion, though I haven’t yet looked into it in detail, that the way the volatility numbers are updated doesn’t make sense for the way we are using Glicko. That is, I think they are supposed to represent a statistical estimate of the player’s actual underlying strength volatility, but with a period of one game, it should not be possible to estimate this. In a real-world tournament where a player plays 15 games, say, you can use those games to get a very solid estimate of their current strength. You can look at how that strength has changed, or not, since their last rating, and compare that to the current estimate of their volatility. If there has been a big change, compared to what you’d expect from the volatility, you revise the estimate of the volatility upward, and vice versa. On the other hand, with a one-game period, consider a player that only ever plays evenly rated games. All the system knows is:

Their estimated strength, which is the same as their opponent’s.
The statistical confidence in that strength.
The estimate of their underlying strength volatility.
Whether they won the game.

You revise the strength upward or downward, based on whether they won or lost, by an amount that takes into account the statistical confidence in that strength. You increase the statistical confidence in the strength because you have another data point, and also decrease the statistical confidence in the strength because time has passed (taking into account the estimate of the underlying strength volatility). If we assume the estimate of the volatility stabilizes at some point, the statistical confidence also stabilizes at the point where the increase from the volatility matches the decrease from the extra data point. So, you eventually wind up increasing or decreasing the strength estimate by a more-or-less fixed amount each game, where the size of the change basically reflects the estimate of the underlying strength volatility. But the basis for estimating the underlying strength volatility in the first place could only possibly be the average amount by which the strength changes per game. It’s circular; there is no good way to estimate the volatility given the data available to the rating system the way it’s currently implemented.

This is possibly another reason for the perceived (measured?) volatility of OGS ratings: the volatility numbers may just be a mathematical fluke from the particular choices of parameters for the system, and they are just too high compared to real player strength volatility. A multi-game period system would estimate this and correct (per-player, even), but the OGS version can’t. If we don’t want multiple-game periods, it might be better to statistically estimate the real volatility using the available data, and peg player volatilities to that value instead of using the update formula. I think this is basically how Glicko-1 works.