Is rating volatility a bug or a feature? [Forked]

paisley · December 31, 2022, 3:49pm

Regarding normal distributions:

In my opinion, normal distributions do have more going for them than their mathematical convenience and the fact that they’ve been well studied. It’s not just an arbitrary choice of bell curve: the Central Limit Theorem says that, if a larger random effect is the result of a sum of small, independent effects, then it will specifically look like a normal distribution. So there is often theoretical justification for choosing normal distribution over other bell curves, although I agree it’s sometimes overstated.

Also, I think there is something to the mathematical convenience, beyond just “it’s convenient.” It’s a simplifying assumption that discards higher order effects when they might not be relevant (or when they might be smaller than other sources of error). Kind of like an engineer or physicist assuming deciding to treat their mechanical springs as ideal, linear springs. The log-likelihood of a normal distribution (our good friend!) is a parabola—a quadratic curve. So deciding to approximate a distribution as normal is akin to deciding to approximate a more complicated function with a simple, low-degree polynomial. This approximation will be better if the data points you are interested in are closer to the mean, rather than out near the tails. (If you’ve studied calculus, remember that there is a formal basis for this approximation via Taylor’s Theorem.)

So, IMO, there’s good reason to treat the normal distribution as distinguished among the other bell curves, not just a kind of intellectual laziness.

Regarding Glicko's assumptions:

Meili_yinhua’s description of Glicko(-2)'s use of the normal/Gaussian distribution agrees with my understanding. I’d like to clarify a few points further, because I think it might be enlightening.

Glicko and similar systems, as far as I understand them, use the normal/Gaussian distribution in two different ways. First, the underlying model assumes that player’s strengths vary randomly following a Brownian motion pattern, which is like a continuous-time version of a normal distribution. Again, this is an assumption of the underlying model; the system assumes that world actually looks like this (or at least, close enough that the assumption will yield useful results).

Second, it makes a computational simplification by approximating the information we have about a player’s strength at any point in time by a normal/Gaussian distribution. True Bayesian inference from game results yields a complicated, awful distribution that, as far as I know, can’t really be parameterized with a fixed number of parameters. The system summarizes this information with a kind of best-fit normal/Gaussian distribution, and discards the additional information. This discarding of information causes, for example, the phenomenon we discussed earlier where a new player can lose to 1d players and have their humble rank rise.

The “interpolation” you talk about, @espoojaram, is the Bayesian inference that meli_yinhua is talking about, combined with the information-discarding summary of the resulting information as a normal/Gaussian distribution.

This isn’t right. The Brownian motion is Gaussian, but it is trying to model how a player’s innate strength changes over time. The model assumes that game-to-game fluctuations in perfomance, in the face of an opponent facing similar fluctuations, follow a sort of logistic distribution: the probability of winning is a Logistic function of the difference in strengths. When two players have a particular difference in strengths, the model assumes that this means that the game outcome has a particular probability, taking into account that the skill a player showcases on any given day is random. The thing that’s moving via Brownian motion is the probability of a having a good day, not how good a day they’re having, so to speak.

The rating deviation is an estimate of our uncertainty about an player’s underlying strength, not our uncertainty about how they will perform in a given game. There is no separate number for that uncertainty—it’s all contained in the single number we call “strength.”

My wikipedia-based knowledge of the history of rating systems is that Elo’s original model assumed that the “strength a player might showcase on any given day” followed a normal/Gaussian distribution, too, but this proved not to be a good fit for the data compared to the Logistic function.

meowkorkor · December 31, 2022, 4:16pm

I think it is reasonable to assume a normal distribution for fluctuations in a given player’s strength.

But the ratings of a large population of players are less likely to follow a normal distribution (even approximately). There are far more very weak players than very strong players.

gennan · December 31, 2022, 4:34pm

True, but the vast majority of those weak players don’t play much go at all, let alone on OGS.

Edit: Also, weak players may not remain weak very long, compared to strong players remaining strong for a long time.

As for histograms of the strength distribution on OGS: Unofficial OGS rank histogram (and graphs) 2022

meili_yinhua · December 31, 2022, 10:56pm

If I remember my papers correctly, I believe Glickman’s Ph.D thesis actually mentions the option of using a Gaussian distribution instead of logistic, but it’s very computationally expensive to incorporate into a ratings system, and the logistic function provides a good approximation to that

EDIT: It just occurred to me that what you might be thinking of is the E function using euler’s constant for the exponentials instead of 10, which unfortunately does not make it based on a Gaussian distribution, and to make one based on a normal distribution you would need to make use of a nasty integral in the “error function” that can’t be analytically solved with the “basic functions” (+, -, *, /, ^, √)

paisley · January 1, 2023, 3:44pm

I was not thinking of a version of the E function using Euler’s constant. That is just a different choice of scale, anyway, as I’m sure you know, because any function of the form 10^(x/r) can be written as e^(x/s) for some s, and vice versa. I think we were thinking of the same thing, using an E function which is the CDF of a normal distribution instead of a logistic function (i.e., the CDF of a logistic distribution). And yeah, the CDF of a normal distribution is nasty.

I believed it wasn’t just a matter of convenience, although my source was just Wikipedia, as I said. The Wikipedia article on the Elo system contains the following paragraphs scattered throughout:

Wikipedia excerpts

Elo’s central assumption was that the chess performance of each player in each game is a normally distributed random variable.
…
Subsequent statistical tests have suggested that chess performance is almost certainly not distributed as a normal distribution, as weaker players have greater winning chances than Elo’s model predicts.[9][10] In practice, there is little difference between the shape of the logistic and normal curve. So it does not matter whether the logistic or normal distribution is used to calculate the expected scores. [11] Mathematically, however, the logistic function is more convenient to work with.[12] FIDE continues to use the rating difference table as proposed by Elo[13]: table 8.1b .
…
The first mathematical concern addressed by the USCF was the use of the normal distribution. They found that this did not accurately represent the actual results achieved, particularly by the lower rated players. Instead they switched to a logistic distribution model, which the USCF found provided a better fit for the actual results achieved.[28][citation needed] FIDE also uses an approximation to the logistic distribution.[13]

I’m not sure how much I believe Wikipedia here, though, after further review. At least one of the linked citations that I chased doesn’t really support the text, and the second paragraph seems to contradict the third regarding FIDE’s current practice.

Edit: After chasing a few more of Wikipedia’s citations, I can’t find a single one that supports the claim that the logistic distribution better fits the data. I did find that a long article by Glickman describes the distinction as “empirically…a moot issue.” (First full paragraph on page 6.) So yes, I do now believe it’s merely matter of mathematical convenience, although a nontrivial one.

espoojaram · January 2, 2023, 5:04pm

Since this thread was started from a fork and it’s gotten quite long and convoluted, I created a follow-up where I start with a summary of this one.

This way, hopefully more people on the OGS forum will have the opportunity to jump in with an understanding of what we’re talking about and be more easily able to contribute should they want to.

In the first reply to that thread I also start talking about the doubts I’ve been having in the last few days. Please join the discussion there