Is rating volatility a bug or a feature? [Forked]

The question of which opponent would give the most interesting game is more difficult than the question of which opponent would give a winning probablity as close to 50% as possible, according to the prediction of the rating system. It adds the question of what is interesting to both players, which is subjective and might be different for both players.

I guess the main issue with any kind of rating system is that people desire different things from it.

New players / beginners usually want rating system which adjusts quickly with their development on game-by-game basis, while more “established” players prefer more stable ranks which they can use as part of their identity.

Also some people hope that their online rank would match closely with their so-called “real” offline aga/egf rank, while others only play online and dont care at all on how ranks here compare to other rating systems.

And of course some people play tens of games a day while others play just a few for any given month, but they still may (or may not) share the same expectations on how volatile their rank should be.

(oh and beside those, there are always some sandbaggers who prefer having lower-than-it-should rank in order to play with players they can win with ease…)

All rating systems are compromises, i honestly like the current ogs implementation a lot better than the older pre-2017 elo-based ranks or even the pre-2021 ranks, but i belong to the “stagnated old farts group” who havent gotten any better for years

6 Likes

No, I mean, creating automatch filters is obviously possible for the devs, but the players wouldn’t have any control on them unless granted by the devs, if the calculation happens on the server side.

You started by saying

and there’s no reason I can see why devs should grant players this ability.


Here, you did conveniently take my quote out of context and remove the portion where I apply caveats and propose solutions (which unless validly argued otherwise, I believe work) to prevent this farming.

Of course I do agree we need to be careful about this when thinking of new rating systems, and I’ll definitely admit this was a blind-spot I had during the course of this conversation, I wasn’t really thinking about farming at all :sweat_smile: So thank you for bringing it up, and thank you for providing examples, which I’ll try to consider carefully.


Yeah, sure, but until we solve this philosophical problem, the 50% probability is our best measure for “interesting” games, or at least it’s the one that has been chosen in developing rating systems so far, apparently. If you can offer a better measure, I for one am listening :laughing:

1 Like

I honestly think the OGS system is already pretty good at both of these things, or it would be if the displayed rank was a (say it with me now) smoothed version of the volatile rating! :laughing: (because if there’s a true upward trend, a smoothed version just takes a few games to catch up, and I’d say fast fluctuations just give newbies false hopes)

This is another problem. For the long term future, I think the only solution would be to develop open source rating software and, in a dream world, it being adopted by all major Go hubs. For the short term, as long as rating systems are population-dependent, I think the only thing we could and should do is essentially awareness campaigns about this concept, and perhaps we might consider ditching the kyu ranking system altogether, and just periodically publish statistics estimating how the OGS rating relates to the rankings of other associations.

1 Like

True. An example of a rating system feature that was dropped, perhaps because it conflicted with people’s expectations:

IIRC the 15 game sliding window was dropped in the 2021 rating system update.
That sliding window confused many players when their rating went up after a loss or down after a win. The cause is ofcourse that their latest game result would cause an older game result to drop out of the sliding window, which may happen to have a greater and opposite effect on their rating than adding their latest game result to the window.
I think that avoiding this confusion and questions about it was an important reason to drop the 15 game sliding window. I don’t know if this (unintentionally) increased rating volatility, but intuitively I think it might have.

So if a new rating system can have a similar effect of rating decrease after a win or rating increase after a loss, be prepared to answer many questions about it.

2 Likes

It was actually more complicated than “games dropped out of the window”, as it was more like “you have 15 ratings that are all calculated with a 1 game staggering of ratings periods, of which those periods have very little correlation to time, and each time you play a game you move to the next of the 15 ratings involved.”

The case is just most likely to occur when the game 15 games ago was a large upset due to the difference in the different initial values in the “current” ratings period of the system

Keep in mind, the concept of a ratings period is supposed to be that the games are to be treated as if simultaneous, meaning the sliding window is constantly reevaluating when each game happened

Correction: by the paper a “ratings period” is defined as a unit of time, glickman just recommends “The Glicko-2 system works
best when the number of games in a rating period is moderate to large, say an average of at
least 10-15 games per player in a rating period. The length of time for a rating period is at
the discretion of the administrator.”

The use of “the last 15 games” is a method specifically chosen to fit the data of the server given the tests the devs had run on it, combined with a typo in the volatility function formula that likely created very skewed results (anoek notes that fixing that typo was a major factor in dropping the “sliding window”)

And this is a reasonable thing to have to answer, because any model that uses evidence to reach a conclusion contrary to the one which the evidence would support would be flawed to not have some justification in the model to reach the conclusion it did. Unless somehow winning is somehow an indicator that you are expected to be weaker than the model predicted before you won, then the rating dropping is the model either A) having made a mistake leading up to the previous iteration, or B) making a mistake in lowering it

2 Likes

This is one of the reasons I stress so much about volatility being a variable with time-dependent effects.
and I will evidence this with a couple things:
this formula in the standard description of the glicko-2 implementation


which is a very simple looking formula, because it underlies a simple concept

and this formula in glickman’s “Parameter estimation in large dynamic paired comparison experiments” (referenced in the first paragraph of the glicko-1 paper)


(from section 3.2, “Updating from the passage of time”)

now, you might wonder, how do these formulas relate? It’s quite simple, the change in true rating (θ) is modeled by a random walk, which has a Variance of v2t, or another way to put it, is that this random variable has a standard deviation of v√t

Now that might still not be perfectly clear what’s going on, Let’s imagine a situation where we know perfectly what your true rating is at time = 0, then naturally at time = 1, we no longer know, it has moved randomly with a Variance of v2, and therefore a Standard Deviation of v. And let’s say we still don’t nail it down or even get any information at all, then after time = 2, our understanding of your ratings distribution is that it has a variance of 2v2, and therefore a standard deviation of v√2.

Now if we look at that first formula again, φ*=√(φ2+ σ’2), where if we will remember that φ is the ratings deviation and σ the volatility in glicko-2’s custom scale, what you will obtain from setting φ equal to the Standard Deviation of our example at time = 1 (that is, v), and σ to that “increase in variance of…strength per unit of time”, naturally it is unsurprising that φ* = v√2.

Now, that may seem contrived, but there’s actually a more general reason that this lines up: when you add two normally distributed random variables with means (m1, m2) and Variances (σ2122), the result is a normally distribution with mean m1+ m2 and Variance σ2122

Which, of course, if you take the square root of that Variance to create the Standard Deviation, you achieve √(σ2122), which is exactly the type of formula at play in Step 6 of glicko-2!

1 Like

One thing that makes me very suspicious about our current system is that the displayed deviation is almost the same for every established player:

Screenshot_20221227-082639
Screenshot_20221227-082611
Screenshot_20221227-082549
Screenshot_20221227-082518

How could a second-order effect that only varies by ~10-20% possibly be doing anything useful?

2 Likes

For starters, as you yourself pointed out it’s not true that the deviation varies only by 10-20%. The deviation starts out at 350, as I’m sure you’ve heard before a rating is considered established when it goes under 160, and the evidence you bring suggests that it just settles around 70 for most well-established players. That’s about a factor of 5 (500%) change from the start.

As @meili_yinhua said at some point, part of the use (the main use?) for the deviation is that if a player has been away for a while, the deviation enlarges to make sure to capture potential changes in the player’s skill while they were away. While the players are playing at stable rates, it’s expected that there should be no need to change the deviation because there’s no expected change to their skill, other than the changes that according to the rating system they exhibit routinely.

You could ask, why does the deviation never seem to go under 60?

There’s a trade-off between stability and adaptability. A system that is very stable cannot adapt quickly to fast changes in a player’s skill.

Those fast changes might be due to the player simply having a good day or a good week, and the system adapting to that might make for better matchmaking and better handicap calculation, leading to more “fair” and “interesting” games with a winning probability as close as possible to 50% for both players.

Whether or not the fluctuations in the rating actually succeed in doing that, is something we need to test. In this topic, we have been discussing the fact that we need to test it, what could be some good ways to test it, and which properties exactly do we wish in a rating system.

Now, I have to be honest, I’m not sure this really answers your doubts. What does “second-order effect” mean?

EDIT: By the way, consider this: if you measure a bunch of things with a ruler, and you’re pretty good with a ruler, your uncertainty should be ±0.5 mm or ±0.25 mm for all measurements. Does that mean the uncertainty is “not doing anything useful” because it’s the same in all cases? No, it’s equally useful in all those cases because it’s doing the same thing in all those cases.

Though I might agree that it might be pointless to display the deviation, more so if it’s similar in most cases, partly because most people don’t really know how to interpret it, myself possibly included.

I’ve based part of my previous reply to @Feijoa on this, but I went to check, and I’m actually not sure this is currently implemented on OGS.

Just by looking at the graphs of players who have been away for weeks or months, even more than a year, the “uncertainty” doesn’t seem to increase at all during or after these pauses. I couldn’t find a single example of that happening. In some cases I’ve seen their ratings change after the pause, but not the uncertainty.


This is a player staying away for 3 years. I know that means the rating system has changed in the meantime, but judging by how their rank dropped since returning, their initial rank was clearly unrepresentative of their current level in the current population, and yet the deviation is between 61 and 63 all the way through.

So unless I’m missing something, it might be that Glicko-2 is well-suited for this purpose, but OGS doesn’t seem to be taking advantage of it :thinking:

3 Likes

the key word is actually in the phrase you used in your statement: established players, generally RD tends to be smaller the more often you play, as we have more evidence as to where they tend to fall, and volatilities are actually expected to be relatively similar for most players, so each player should approach a minimum for their frequency of play that indicates that we have a higher confidence in it appearing in a smaller region.

Now, of course, this assumes, like espo notes, that this is being taken advantage of.

This is my fear to the implementation of instant-time ratings periods… that it might forget the time aspect, leading to this phrase when I first mentioned the Lichess style updating

Is that you can’t actually follow the normal Step 6 of the glicko-2 algorithm in instant-time formulations, and actually need to add volatility2*time to make the update work effectively for such a system, since that time portion was normally updated at each ratings period step

1 Like

I’d say this is another piece of evidence that the rating system isn’t doing what Glicko-2 is supposed to, although it might have something to do with the way some players play correspondence: among the players Feijoa showcased, Sofiam is the one that plays most frequently, and she’s the one with the highest deviation, on top of being noted for having a wildly fluctuating rank.

It seems deviation on OGS works essentially like a usual Sample Standard Deviation: when the sample has a more spread out distribution, it’s larger :laughing: in other words, instead of anticipating changes in skill, it only reacts to them after the fact.
As a pure guess, I might venture that this is part of the “calculations” you mentioned that were introduced when reducing the Glicko window to 1 game.

So I feel @Feijoa’s observation was on to something after all! Sorry for doubting you :laughing:

2 Likes

As for a single anecdotal data point on OGS’ rating volatility compared to EGF’s rating volatility, these are my rating graphs from both over a selected period covering a similar number of games (~140), albeit that the corresponding time period on OGS is about 2 years, while the corresponding time period on EGF is about 7 years.

OGS, rating spread about 1.5 ranks:

EGF (red line), rating spread about 0.6 ranks:

Ofcourse there may be other factors that might affect rating volatility. My EGF games are IRL tournament games. Perhaps I perform more consistently in that setting. I may play less seriously in my internet games. On the other hand, my OGS games are mostly correspondence and I do take those fairly seriously too.

1 Like

I don’t quite understand Glicko-2, but my impression is that the deviation part is a small modification to a system that mostly works without it. And a small modification that is almost the same for everyone is going to be doubly insignificant.

Looking at the equations, there’s this key factor g:

image
image

So my g is 0.980 and Sofia’s is 0.973. Is that really a useful distinction to make between us?

If I’m doing this right, 350 for a new player gives a g of 0.667. Is that small enough to make a significant difference to anything in those first few games before it settles down to ~65 for everyone? Does anyone ever get a high deviation after the first few games?

…If it produces better predictions… yes? Ratings systems are often competing for small edges in terms of better predictions, as evidenced by this chart from the Whole History Ratings paper


taken from KGS data no less

you’ll notice that Glicko-1 predicts around .4% points of games more often than Elo in the training data, which isn’t much, but also the added cost relative to Elo in terms of time isn’t that much either (in terms of absolute time.

Is it significant? Maybe. It appears to be consistently better enough to be accepted as “more accurate”, but I haven’t seen any p-values thrown around

1 Like

No, except for a few edgecases.
With the 1 game at a time rating update, the RD depends only on your previous RD and the rating difference of both players.
The edgecases are players (bots) mostly playing against much stronger/weaker opponents.


With more games in a rating period, one can reduce the the RD from 60 to 30. Rating change is quadratic in RD, so we would get a way more stable rating for the cost of slower adjustments to changed player strength.


btw. to get rating periods with ~15 games, a rating period needs a length of ~2 weeks and a month.

3 Likes

By the way, I never apologized to @Uberdude for getting annoyed at his contribution.

Re-reading his reply in hindsight, I hadn’t even understood what he had said exactly. Though I’m still not sure exactly what he intended to do, I don’t think I should have assumed a “malicious” intent, so I apologize.

1 Like

Uuuh, @meili_yinhua, you might be interested in this?

@flovo, do you know what’s happening here?

I know I’m coming into the middle of a long thread, and I haven’t read and fully comprehended everything that has been written so far (I have skimmed, but it’s long!), so I apologize if I repeat what’s already been stated. Regardless, I’ve spent some time thinking about these problems, and have some mathematical training, so I hope I can input something useful.

My preferred approach to design or evaluate a rating system would be as follows.

  1. Build (or choose, since this has been done in the papers) a probabilistic model that attempts to account for things like the probability of game outcomes given strength differences, the likelihood of wild swings in player strength, etc. This model would probably come with some tweakable global parameters.
  2. Evaluate a system (or a proposed system) in terms of the log-likelihood of the observed data (using the values of the global parameters that give the highest log-likelihood). There are at least two sources of likelihood here. First, the rating system assigns ratings to players that change over time, and the probabilistic model assigns some quantitative likelihood to these fluctuations. Second, there are game outcomes between players to whom the rating system has assigned a strength, and the probabilistic model assigns probabilities to those game outcomes.

The log-likelihood from step 2 is likely a negative number, and its inverse (so, after multiplying by -1 to make it positive), measures information (in bits, say, if the log is base 2). We can think this number as a quantitative measure of how much the rating system is “surprised” by the ratings and results data. (It is related to a lower bound on the number of bits needed to store the ratings-and-results data in a perfect data compression scheme, so it does actually measure some real-world quantity, but that’s not very important here.)

I like this approach to thinking about the problem, because it lets us think about ratings volatility intuitively as well as quantitatively. For example, consider @Uberdude’s hypothetical rating system that perfectly assigns either 30k or 9d to a player based on whether they will win their next game. If we have a system like that and want to store a record of the ratings-and-results data, we don’t have to record the actual game results, only the ratings, because the results are a foregone conclusion and implied by the ratings. There is one bit worth of surprise per game, which is whether black or white was the 9d player for that game; everything else follows.

However, that’s terribly inefficient: it’s the same amount of bits needed to store the results if we don’t have any rating system at all, and consider each game as a toss-up: one bit per game, to record the winner. This matches the intuition that such a system, while it perfectly predicts the game outcomes, isn’t a useful rating system.

A good rating system is a tradeoff, in this precise quantitative way: ratings that don’t change too much over time, so that ratings changes are predictable and therefore relatively “unsurprising” on average, and accurate game outcome predictions so that the game results themselves are unsurprising. It thus has the power to partially explain the observed results, which I think is part of what we appreciate in a good rating system (“Is this 4-win streak just a run of good luck, or have I gotten better?”)


TLDR: I vote that log-likelihood of observed results, in a model that accounts for both the likelihood of changes in strength and the probability of game outcomes, is the proper way to evaluate a rating system. My intuition is that OGS volatility is too high, if only because it doesn’t distinguish between games played back-to-back and games played after a long break, like a good model would. Consequently, I hypothesize that adjusting the system in a way that treats the volatility number as “ratings variance per wall-clock time” instead of “ratings variance per game played” would improve the situation at least a little bit.


PS: I also have a sneaking suspicion, though I haven’t yet looked into it in detail, that the way the volatility numbers are updated doesn’t make sense for the way we are using Glicko. That is, I think they are supposed to represent a statistical estimate of the player’s actual underlying strength volatility, but with a period of one game, it should not be possible to estimate this. In a real-world tournament where a player plays 15 games, say, you can use those games to get a very solid estimate of their current strength. You can look at how that strength has changed, or not, since their last rating, and compare that to the current estimate of their volatility. If there has been a big change, compared to what you’d expect from the volatility, you revise the estimate of the volatility upward, and vice versa. On the other hand, with a one-game period, consider a player that only ever plays evenly rated games. All the system knows is:

  1. Their estimated strength, which is the same as their opponent’s.
  2. The statistical confidence in that strength.
  3. The estimate of their underlying strength volatility.
  4. Whether they won the game.

You revise the strength upward or downward, based on whether they won or lost, by an amount that takes into account the statistical confidence in that strength. You increase the statistical confidence in the strength because you have another data point, and also decrease the statistical confidence in the strength because time has passed (taking into account the estimate of the underlying strength volatility). If we assume the estimate of the volatility stabilizes at some point, the statistical confidence also stabilizes at the point where the increase from the volatility matches the decrease from the extra data point. So, you eventually wind up increasing or decreasing the strength estimate by a more-or-less fixed amount each game, where the size of the change basically reflects the estimate of the underlying strength volatility. But the basis for estimating the underlying strength volatility in the first place could only possibly be the average amount by which the strength changes per game. It’s circular; there is no good way to estimate the volatility given the data available to the rating system the way it’s currently implemented.

This is possibly another reason for the perceived (measured?) volatility of OGS ratings: the volatility numbers may just be a mathematical fluke from the particular choices of parameters for the system, and they are just too high compared to real player strength volatility. A multi-game period system would estimate this and correct (per-player, even), but the OGS version can’t. If we don’t want multiple-game periods, it might be better to statistically estimate the real volatility using the available data, and peg player volatilities to that value instead of using the update formula. I think this is basically how Glicko-1 works.

2 Likes

I think this case is mostly about what is displayed. I believe OGS displays (in some situations) its estimate for how weak a player might be (“humble rank”; rating - 2 * rating deviation), instead of its actual best guess (the rating). The actual rating for a new player is around 6k. When they play and lose to 1d players, it can’t actually lower their rank that much, because a 6k player is expected to lose to dan players! However, just by having more data points, the rating deviation decreases. This is indeed a statistical fluke—the “real” confidence interval for the player should be two-sided, with one deviation for how much stronger they might be than their rating, and another deviation for how much weaker they might be than their rating. In the simplified model of the Glicko(-2) system, there is just one number, though. Having lost to 1d players, they’re probably not that much stronger than 6k, so their rating deviation comes down. Because of the simplifying assumption that this one deviation represents how much stronger or weaker they might be than their rating, the system’s estimate for how weak they might be rises, and this is what OGS displays sometimes. Again, the system’s actual estimate for their strength is decreasing.

This situation will correct itself, with their rating falling rapidly, if they start losing games to weaker people. Remember, if all we know about someone is that they’ve lost 20 even games to 1d players, they may very well actually be 6k.

2 Likes