Is rating volatility a bug or a feature? [Forked]

I don’t quite understand Glicko-2, but my impression is that the deviation part is a small modification to a system that mostly works without it. And a small modification that is almost the same for everyone is going to be doubly insignificant.

Looking at the equations, there’s this key factor g:

image
image

So my g is 0.980 and Sofia’s is 0.973. Is that really a useful distinction to make between us?

If I’m doing this right, 350 for a new player gives a g of 0.667. Is that small enough to make a significant difference to anything in those first few games before it settles down to ~65 for everyone? Does anyone ever get a high deviation after the first few games?

…If it produces better predictions… yes? Ratings systems are often competing for small edges in terms of better predictions, as evidenced by this chart from the Whole History Ratings paper


taken from KGS data no less

you’ll notice that Glicko-1 predicts around .4% points of games more often than Elo in the training data, which isn’t much, but also the added cost relative to Elo in terms of time isn’t that much either (in terms of absolute time.

Is it significant? Maybe. It appears to be consistently better enough to be accepted as “more accurate”, but I haven’t seen any p-values thrown around

1 Like

No, except for a few edgecases.
With the 1 game at a time rating update, the RD depends only on your previous RD and the rating difference of both players.
The edgecases are players (bots) mostly playing against much stronger/weaker opponents.


With more games in a rating period, one can reduce the the RD from 60 to 30. Rating change is quadratic in RD, so we would get a way more stable rating for the cost of slower adjustments to changed player strength.


btw. to get rating periods with ~15 games, a rating period needs a length of ~2 weeks and a month.

3 Likes

By the way, I never apologized to @Uberdude for getting annoyed at his contribution.

Re-reading his reply in hindsight, I hadn’t even understood what he had said exactly. Though I’m still not sure exactly what he intended to do, I don’t think I should have assumed a “malicious” intent, so I apologize.

1 Like

Uuuh, @meili_yinhua, you might be interested in this?

@flovo, do you know what’s happening here?

I know I’m coming into the middle of a long thread, and I haven’t read and fully comprehended everything that has been written so far (I have skimmed, but it’s long!), so I apologize if I repeat what’s already been stated. Regardless, I’ve spent some time thinking about these problems, and have some mathematical training, so I hope I can input something useful.

My preferred approach to design or evaluate a rating system would be as follows.

  1. Build (or choose, since this has been done in the papers) a probabilistic model that attempts to account for things like the probability of game outcomes given strength differences, the likelihood of wild swings in player strength, etc. This model would probably come with some tweakable global parameters.
  2. Evaluate a system (or a proposed system) in terms of the log-likelihood of the observed data (using the values of the global parameters that give the highest log-likelihood). There are at least two sources of likelihood here. First, the rating system assigns ratings to players that change over time, and the probabilistic model assigns some quantitative likelihood to these fluctuations. Second, there are game outcomes between players to whom the rating system has assigned a strength, and the probabilistic model assigns probabilities to those game outcomes.

The log-likelihood from step 2 is likely a negative number, and its inverse (so, after multiplying by -1 to make it positive), measures information (in bits, say, if the log is base 2). We can think this number as a quantitative measure of how much the rating system is “surprised” by the ratings and results data. (It is related to a lower bound on the number of bits needed to store the ratings-and-results data in a perfect data compression scheme, so it does actually measure some real-world quantity, but that’s not very important here.)

I like this approach to thinking about the problem, because it lets us think about ratings volatility intuitively as well as quantitatively. For example, consider @Uberdude’s hypothetical rating system that perfectly assigns either 30k or 9d to a player based on whether they will win their next game. If we have a system like that and want to store a record of the ratings-and-results data, we don’t have to record the actual game results, only the ratings, because the results are a foregone conclusion and implied by the ratings. There is one bit worth of surprise per game, which is whether black or white was the 9d player for that game; everything else follows.

However, that’s terribly inefficient: it’s the same amount of bits needed to store the results if we don’t have any rating system at all, and consider each game as a toss-up: one bit per game, to record the winner. This matches the intuition that such a system, while it perfectly predicts the game outcomes, isn’t a useful rating system.

A good rating system is a tradeoff, in this precise quantitative way: ratings that don’t change too much over time, so that ratings changes are predictable and therefore relatively “unsurprising” on average, and accurate game outcome predictions so that the game results themselves are unsurprising. It thus has the power to partially explain the observed results, which I think is part of what we appreciate in a good rating system (“Is this 4-win streak just a run of good luck, or have I gotten better?”)


TLDR: I vote that log-likelihood of observed results, in a model that accounts for both the likelihood of changes in strength and the probability of game outcomes, is the proper way to evaluate a rating system. My intuition is that OGS volatility is too high, if only because it doesn’t distinguish between games played back-to-back and games played after a long break, like a good model would. Consequently, I hypothesize that adjusting the system in a way that treats the volatility number as “ratings variance per wall-clock time” instead of “ratings variance per game played” would improve the situation at least a little bit.


PS: I also have a sneaking suspicion, though I haven’t yet looked into it in detail, that the way the volatility numbers are updated doesn’t make sense for the way we are using Glicko. That is, I think they are supposed to represent a statistical estimate of the player’s actual underlying strength volatility, but with a period of one game, it should not be possible to estimate this. In a real-world tournament where a player plays 15 games, say, you can use those games to get a very solid estimate of their current strength. You can look at how that strength has changed, or not, since their last rating, and compare that to the current estimate of their volatility. If there has been a big change, compared to what you’d expect from the volatility, you revise the estimate of the volatility upward, and vice versa. On the other hand, with a one-game period, consider a player that only ever plays evenly rated games. All the system knows is:

  1. Their estimated strength, which is the same as their opponent’s.
  2. The statistical confidence in that strength.
  3. The estimate of their underlying strength volatility.
  4. Whether they won the game.

You revise the strength upward or downward, based on whether they won or lost, by an amount that takes into account the statistical confidence in that strength. You increase the statistical confidence in the strength because you have another data point, and also decrease the statistical confidence in the strength because time has passed (taking into account the estimate of the underlying strength volatility). If we assume the estimate of the volatility stabilizes at some point, the statistical confidence also stabilizes at the point where the increase from the volatility matches the decrease from the extra data point. So, you eventually wind up increasing or decreasing the strength estimate by a more-or-less fixed amount each game, where the size of the change basically reflects the estimate of the underlying strength volatility. But the basis for estimating the underlying strength volatility in the first place could only possibly be the average amount by which the strength changes per game. It’s circular; there is no good way to estimate the volatility given the data available to the rating system the way it’s currently implemented.

This is possibly another reason for the perceived (measured?) volatility of OGS ratings: the volatility numbers may just be a mathematical fluke from the particular choices of parameters for the system, and they are just too high compared to real player strength volatility. A multi-game period system would estimate this and correct (per-player, even), but the OGS version can’t. If we don’t want multiple-game periods, it might be better to statistically estimate the real volatility using the available data, and peg player volatilities to that value instead of using the update formula. I think this is basically how Glicko-1 works.

2 Likes

I think this case is mostly about what is displayed. I believe OGS displays (in some situations) its estimate for how weak a player might be (“humble rank”; rating - 2 * rating deviation), instead of its actual best guess (the rating). The actual rating for a new player is around 6k. When they play and lose to 1d players, it can’t actually lower their rank that much, because a 6k player is expected to lose to dan players! However, just by having more data points, the rating deviation decreases. This is indeed a statistical fluke—the “real” confidence interval for the player should be two-sided, with one deviation for how much stronger they might be than their rating, and another deviation for how much weaker they might be than their rating. In the simplified model of the Glicko(-2) system, there is just one number, though. Having lost to 1d players, they’re probably not that much stronger than 6k, so their rating deviation comes down. Because of the simplifying assumption that this one deviation represents how much stronger or weaker they might be than their rating, the system’s estimate for how weak they might be rises, and this is what OGS displays sometimes. Again, the system’s actual estimate for their strength is decreasing.

This situation will correct itself, with their rating falling rapidly, if they start losing games to weaker people. Remember, if all we know about someone is that they’ve lost 20 even games to 1d players, they may very well actually be 6k.

2 Likes

Holy, you’re right :neutral_face: I hadn’t realized the confidence interval was one-sided, I thought it was just a graphical glitch! (Most likely because as you say there’s a “true rank” at 6 kyu at the center of the interval)

(I’ll read your other message when I have more time; in the meantime, thanks for your contribution and for stating outright that you hadn’t read everything, I appreciate it :slight_smile:)

1 Like

This is a great explanation and the first time I’ve actually understood this “humble rank” concept. (But I think there’s no factor of two.) I also dug up what looks like the original proposal explaining what it’s supposed to be:

So I guess all the matchmaking parts are broken now leaving just the display, which is only in one or two places anyway.

1 Like

Yeah, humble rank is an arcane bit of knowledge I don’t pay much attention to but have to explain every now and then, it’s a weird middle of the road solution due to some very strongly held opinions about how glicko is “supposed” to have a certain middle-of-the-road insertion position (when really this is the result of assuming to not have information for a better informed “initial” ratings distribution)

4 Likes

IIRC the question of humble rank came up in the 2021 discussions, and it was said that humble rank was broken. I believe that GreenAsJade/Eugene, who introduced humble rank to OGS, agreed that it was broken. What, if anything, has happened to it since that time, I don’t know.

1 Like

I was just now trying to trace down my “factor of two” confusion, and now I’m confused again about humble rank. I went to a player’s page who has played no rated games, and their rank in the display area is “1150 ± 350” / “11.9k ± 4.9”. However, in the list of games, it displays as “6k”.

I had thought (and said, above), that 6k was the center estimate. Is it actually 12k? If so, what is the 6k in the games list? It doesn’t seem like a 11.9-4.9=7k, plus some rounding error. For one, that would mean it’s displaying some kind of “boastful rank,” rather than humble rank, in the games list, which seems weird. Also, my rank is “9.0k ± 1.1”, which would place my “boastful rank” at something like 7.9k, but it displays in the same game lists as 9.0k [correction: 9k].

So it does seem like there is some weird display bug somewhere. My explanation for @espoojaram’s recent example only makes sense if the 11.9k was the humble rank but, if so, why does it display as “11.9k ± 4.9” on the profile page? That clearly seems to indicate 11.9k as the central estimate, not the humble rank. But even then, if it is displaying “[rating - deviation] ± [deviation]” on the profile and “[rating]” on the games list, then why does my rank display as “9.0k ± 1.1” on my profile and “9k” in the games list? I’m pretty sure it wasn’t until my actual rank was “9.0” that it displayed that way, either, having rounded up to “10k” even when I was 9.2 on the profile.

I’m now much more confused than I was yesterday.

[Regarding the “factor of two” confusion, that came from the Glicko-2 example PDF, which suggests displaying an approximate 95% confidence interval as ± 2* RD. I had assumed that was what OGS implemented.]

[Retracted; the relevant discussion is in another thread.] Question. Without getting into display bugs, does anyone here know, outright, whether the initial rating (the actual rating, for the Glicko algorithm) we use here, is 1150, or is it or 1500?

2 Likes

We just had a discussion about this, starting here: Are OGS rankings inflated, deflated or neither? - #13 by Conrad_Melville

1 Like

This is the problem with all this stuff being split between two different threads.

1 Like

To answer your question, I no longer know. It looks like we are back to 11.9k as the starting pseudo-rank, but it is unclear whether the 6k on the thumbnails is a “leftover” (therefore a bug) or if it is used for matching.

FYI, all ranks are rounded up for anything over X.0

1 Like

Hmmm. My conjecture/interpretation, when you pointed that out, was that the rating plotted out in the profile page is a weighted average, or interpolation, of the “low boundary of the 1σ-confidence interval” (the “humble rank”) and the current Glicko-2 rating estimate, and basically moves from 100% humble rank at the beginning to 100% Glicko-2, as a function of the deviation itself.

Since ratings are considered “accurate” when the deviation reaches 160, I would guess the parameter for this interpolation slides as a function of the interval [350,160].

Something like this:
image
(sigh, I wish this platform allowed for math equations)

(EDIT: I realized the formula simplifies to this :sweat_smile:)
image

In the above formula I’m assuming that the Deviation starts out at 350 like it’s displayed in the graphs.

But even if that formula is right, that doesn’t answer why the deviation is displayed as 4.9. (It being expressed as “11.9 ±4.9” doesn’t surprise me, after all it’s what they do in a lot of scientific research too, note down a simplified symmetrical uncertainty even though the uncertainty is usually not symmetrical, and this quantity we’re talking about is essentially entirely made up)

I’d say that the true provisional rating being 1150 wouldn’t make sense given the example I brought up above (losing three games and gaining rating each time).

Also, note that 1150=1500-350, whereas 11.9 != 6 +4.9.

My guess is that the deviation is being converted into a “rank-like” quantity through the use of the usual formula, which is log-based, so it ends up doing this weird effect of shrinking the deviation.


Well, this is a very technical thread, and also the main topic of this thread is actually to devise a test to verify once and for all if the current rating volatility is hurting the handicap system and the matchmaking.

2 Likes

I plan to start looking into the ratings issues in more detail. To bump this earlier question, because I don’t see it answered, do we know if the python code found in the goratings GitHub is the actual production code? (And therefore definitely reflecting what OGS actually does.) Or is that exclusively algorithm development code?

The README on GitHub leaves it ambiguous, in my reading.

1 Like

While the question remained unanswered, this reply above from a co-developer seems to imply that it is.


Oh, I agree :laughing:

1 Like

@paisley:
I’ve finally gotten around to read you first reply to this thread :laughing:

Given how long it appeared to be, a surprisingly painless experience – even for a quasi-layman like me, it was pretty understandable.

Well, I don’t know what log-likelihood is, but I’ll take you at your word that it might be a good measure for evaluating a rating system.

Now, I think the best way to perform the test is to implement all of the decent proposals for methodologies and get many different kinds of evaluation. If they are good measures and there’s a clear-cut answer to be found, they should mostly agree, and if they don’t agree, that will tell us something weird is going on. And anyway, the decision will be left to the readers/peer-reviewers as to if any of the systems are good.

So what I mean is: for what it’s worth, if it comes down to me I’ll do my best to understand the method you proposed thoroughly and implement it. (Hopefully people more proficient than me can collaborate though :laughing:)

I believe the point you bring up about the volatility (that it should be impossible to update with 1-game rating periods) was vaguely referenced in this reply and then again in this reply, not much, but it might help you. It would also have been impossible to get specific without seeing the code, which you seem to be working on (assuming that goratings repository is the code used right now).

Before reading, I was actually planning on giving you some advice on how to format and layout a long message to make it more readable, but to be honest the vibe I’m getting in the last few days is that I have alienated most of this forum’s community with my long replies, so I guess my advice is: don’t write long replies? :laughing:

You haven’t alienated me. I enjoy reading your posts, but I didn’t spend much time on my computer or phone during the past week with Christmas dining and some outings (enjoying a week off).

3 Likes