Is rating volatility a bug or a feature? [Forked]

it’s a standard bias-variance tradeoff
If the high resolution happens to not be overfit and makes good predictions on the overall set of individual games, then it should be able to be used to create more even matches as it’s predicting the current skill of the user accurately on a per-game basis

I think it will be very difficult to accurately model individual players to reliably predict their current skill (beyond a simple Elo winrate prediction). How many per-player parameters would such a model need?

One might need to train a hefty AI to make personalised short-term skill predictions that are more reliable than your run-of-the-mill Elo-like rating system.

I mean yes, and Glicko-2 uses a modified Elo prediction I’m just making no assumptions on what the data would tell me if I were to run tests on it, despite that the qualitatively the high responsiveness seems suspicious

In the Elo rating system, there is a responsiveness parameter K, but it is the same for all players of a particular rating, so not one variable per individual player.
K tends to be smaller for higher ratings and larger for lower ratings, so high ratings will change more slowly to game results and low ratings will change more quickly. K is basically the maximum rating change from a single game.

The EGF rating system also has such a thing, but it is called con (for historical reasons). It also depends on rating only.

As far as I know, Glicko-2 has a similar thing, but it is tracked on each player individually. I assume there can still be an universal factor to scale the overall rating responsiveness, but I don’t really know.

There is, the initial ratings deviation, the initial volatility, and a special constant Tau for how quickly the volatility will fit to an individual player

In Glicko-1 it was more clear since the volatility was essentially the population-wide constant c

But also the bias-variance is also affected by choice of ratings periods, with ratings periods of shorter time frames (and therefore less games) have higher variance, and longer time frames (also therefore more games) have higher bias.

as I understand it, OGS has followed the Lichess route to have instant-time ratings periods, namely it is calculated upon each game with special calculations for how volatility affects RD due to the impracticality of updating the RD at every time-step on the server.

This necessarily creates a relatively high-variance, low-bias result, and the tests done on it to look at its accuracy are supposed to give a relatively okay result on tuning parameters (although, once again, there doesn’t seem to be any use of the more standard tests, or even a separation between training data and testing data of the model, the latter of which tends to suggest a high likelihood of choosing a version of the model that overfits to training data).

So while it is understandable to be suspicious of the highly responsive results, It’s still left to proper testing to come to a conclusion

1 Like

Imagine a rating system powered by a pre-cog (from Minority Report) which perfectly predicts if you will win or lose your next game. It thus flips your rank between 30k and 9d and has a 100% accuracy in predicting your results. However I think we would all agree this level of volatility is useless and a bug not a feature.

Reading this discussion, and reflecting on the previous answers I got from meili, I’ve had another galaxy brain idea. I’ll add it to the bucket of “rating systems to test if I ever manage to implement the methodology I described earlier, that nobody seems to care about so I’ll take their silence as tacit agreement on its validity”.

Of course we still need to test if the high-frequency oscillation makes for better win% predictions, but suppose we come to that conclusion, and the system is actually able to catch on to players temporarily playing better and stuff.

One problem is that this probably compounds with what meili talked about here:

 I don’t necessarily know if they meant this, but one example I can think of is a player who plays a lot every day, then goes to bed, and comes back 6 to 16 hours later, say. Unless there’s a already a calculation to prevent this, the system’s prediction now is “stuck” at the level they were playing at the end of the previous day, and if now their skill is swinging differently, the system will take a few games to catch up.

 I guess worst case scenario, the player can end a day on a high note, come back the next day on a low note, start playing and warming up game after game, while the system is catching up to the fact that they started on a low note. In this example, for a bunch of games at the end of the second day, the rating system is very wrong and has very bad win% predictions. This might even happen multiple times during the day. A player is playing badly, they take a pause, they come back renewed, and the system has to play catch up.

I think this is much, much worse in the case of correspondence games, by the way, because the player’s ability fluctuated over the course of the game itself and their skill at the moment the game ends is bound to not necessarily be well represented by the skill they exhibited in the crucial moments of the game. I think this is another thing we need to test and consider, whether it’s better to have a different rating system for correspondence games (especially for players that play a lot of them simultaneously), rather than just different ratings with the same system. (btw I’m formatting this differently to help the reader’s eye parse this long reply – I still need to find a good way to do it)

 Back to the other problem, I had the idea that a solution might be to kinda have a “low-frequency rating” and a “high-frequency rating”, and make it so that there’s a period of “decay” after each game, so that if it’s been a while from the last game, the system uses an average of the two ratings, weighted as a function of how long it’s been (relative to their previous frequency of play, maybe). The uncertainty of this compound rating should also be adjusted as a function of that interval of time.

 In terms of user interface, ideally it should be as transparent as possible without being too confusing, but I guess those are empty words. Personally, as I essentially said many times, I think it would be better to mostly only display a rank calculated from the low-frequency rating, because (well, this is my hypothesis) it’s more likely to be representative of their knowledge at any moment, and to match the cultural expectations attached to the rank.

 When a auto-handicap game happens, there should be a clearly visible explanation for how it was calculated, explicitly saying e.g. “our testing indicated that this system is most likely to predict the appropriate handicap”. Perhaps, only for auto-handicap games, we could display the player’s rank as e.g. “10k(+3)”, with a question mark next to it, hovering on the question mark brings up a panel with an explanation.

 I don’t know if there are other situations where the rating being different than the displayed rank would be confusing. The only thing is that for transparency’s sake, the high-frequency rating, and/or the weighted average that is currently relevant, should be optionally visible in the rating graph.

If this were possible, players might want to add automatch filters to select opponents who are likely to be currently tilting from an upset loss, or opponents who seem to be intoxicated or distracted somehow.

Edit: added the suppose clause

1 Like

Nope, we don’t all agree. As I pointed out multiple times (and I don’t know if I have the right to get annoyed at people coming in without reading the topic, but, well, it’s starting to happen), and as evidenced by this disagreement, the purpose of a rating system is subjective, but many of us (or at least many of the people who have written in this topic) think it should produce the best matchmaking and handicap prediction possible.

I also said I do agree that the displayed rank should have the function to match the cultural expectations attached to the traditional ranking system as well as possible, and I have proposed ways to possibly succeed at doing both things.


hmm, is adding automatch filters even possible? Even if one is very good with browser extensions and coding, I’d guess the automatch is calculated on the server’s side and is inaccessible to the user.

A problem could be that it could add the temptation to, say, cancel matches, or manually challenge such players.

Hmm, yeah, this could be used to farm the low-frequency rating to a limited degree (edit: in the case that they were implemented as completely separate rating systems). Unless, wait a minute, the low-frequency rating used the compound-frequency rating of the opponent (which is accurate since in the hypothetical scenario it’s the best measure of the level at which they played that match) to update. I think then that strategy would be pointless.

[EDIT:] Oh, actually, I need to point out that it would all depend on how exactly these two ratings are implemented. The “low-frequency rating” system might even just be the notorious “averaged out version” of the high-frequency rating, in which case I don’t think farming it would be possible, for the same reason as above. [end of edit]

Unless you meant some other purpose other than rating farming?

I know espo has tried to counter this point, but the real issue with this prediction system is that it supposedly only predicts the outcome of the game you are going to play, and not necessarily against every possible player and using that to choose the most interesting match from the field

3 Likes

Not sure why you were annoyed by that comment. It seems you’re aiming at a rating system that can predict individual game results much more accurately than the relatively simple systems that are currently in use, by factoring in additional personalized time based variables.
I read @Uberdude’s comment as sketching the best possible rating system by those criteria: a (hypothetical) clairvoyant rating system that knows ahead of time that a package deliverer will come to your door while you are in your last byoyomi period. With such knowledge, the rating system could even decide to not award your opponents many rating points, because it knows about the interference to your normal ability.

I was just slightly annoyed at the “we would all agree” part.

Also, I might have been wrong, but I was reading between the lines that in the context of this topic, Uberdude’s comment seemed to constitute an example of a “slippery slope” rhetorical technique, where you say something that technically can’t be refuted because it’s an extreme hypothetical scenario, but then you’re implying your conclusion also applies to the real or realistic scenario.

I wouldn’t classify it along those lines, because I do believe a ratings system (or, more accurately, a matchmaking system) that can predict the outcome of any possible game to perfect precision would be useful, provided it doesn’t hinge its expectations purely on the game that the matchmaking system is going to pick for the player.

The only issue is that by having perfect 1 and 0 probability via precognition, it’s not clear how it would pick “interesting” games, beyond taking players with similar numbers of opponents that they would beat if a game were played between them

1 Like

Ah, yes, I agree, I forgot to add it to my reply to gennan.

EDIT: To be clear, I mean that, based on Uberdude saying this:

then gennan seems to be implying here:

that this feature would be undesirable, and I would disagree, for the same principles meili pointed out. EDIT for clarity: If the system performs better matchmaking, anything goes, in theory.

A lot is possible. A change to the rating system requires code changes on the server anyway, and I can’t think of any reason why it would be impossible to extend automatch filters by adding parameters like these.

Yes, adding complexity to the rating system, turning the system more into a black box with unpredictable side effects, could give sandbaggers and other rating manipulators more tools to play the system. Already with Glicko-2 some people have successfully exploited deviation and volatility to boost their rating on a different game server:
https://www.reddit.com/r/TheSilphRoad/comments/hwff2d/farming_volatility_how_a_major_flaw_in_a/

1 Like

The question of which opponent would give the most interesting game is more difficult than the question of which opponent would give a winning probablity as close to 50% as possible, according to the prediction of the rating system. It adds the question of what is interesting to both players, which is subjective and might be different for both players.

I guess the main issue with any kind of rating system is that people desire different things from it.

New players / beginners usually want rating system which adjusts quickly with their development on game-by-game basis, while more “established” players prefer more stable ranks which they can use as part of their identity.

Also some people hope that their online rank would match closely with their so-called “real” offline aga/egf rank, while others only play online and dont care at all on how ranks here compare to other rating systems.

And of course some people play tens of games a day while others play just a few for any given month, but they still may (or may not) share the same expectations on how volatile their rank should be.

(oh and beside those, there are always some sandbaggers who prefer having lower-than-it-should rank in order to play with players they can win with ease…)

All rating systems are compromises, i honestly like the current ogs implementation a lot better than the older pre-2017 elo-based ranks or even the pre-2021 ranks, but i belong to the “stagnated old farts group” who havent gotten any better for years

5 Likes

No, I mean, creating automatch filters is obviously possible for the devs, but the players wouldn’t have any control on them unless granted by the devs, if the calculation happens on the server side.

You started by saying

and there’s no reason I can see why devs should grant players this ability.


Here, you did conveniently take my quote out of context and remove the portion where I apply caveats and propose solutions (which unless validly argued otherwise, I believe work) to prevent this farming.

Of course I do agree we need to be careful about this when thinking of new rating systems, and I’ll definitely admit this was a blind-spot I had during the course of this conversation, I wasn’t really thinking about farming at all :sweat_smile: So thank you for bringing it up, and thank you for providing examples, which I’ll try to consider carefully.


Yeah, sure, but until we solve this philosophical problem, the 50% probability is our best measure for “interesting” games, or at least it’s the one that has been chosen in developing rating systems so far, apparently. If you can offer a better measure, I for one am listening :laughing:

1 Like

I honestly think the OGS system is already pretty good at both of these things, or it would be if the displayed rank was a (say it with me now) smoothed version of the volatile rating! :laughing: (because if there’s a true upward trend, a smoothed version just takes a few games to catch up, and I’d say fast fluctuations just give newbies false hopes)

This is another problem. For the long term future, I think the only solution would be to develop open source rating software and, in a dream world, it being adopted by all major Go hubs. For the short term, as long as rating systems are population-dependent, I think the only thing we could and should do is essentially awareness campaigns about this concept, and perhaps we might consider ditching the kyu ranking system altogether, and just periodically publish statistics estimating how the OGS rating relates to the rankings of other associations.

1 Like

True. An example of a rating system feature that was dropped, perhaps because it conflicted with people’s expectations:

IIRC the 15 game sliding window was dropped in the 2021 rating system update.
That sliding window confused many players when their rating went up after a loss or down after a win. The cause is ofcourse that their latest game result would cause an older game result to drop out of the sliding window, which may happen to have a greater and opposite effect on their rating than adding their latest game result to the window.
I think that avoiding this confusion and questions about it was an important reason to drop the 15 game sliding window. I don’t know if this (unintentionally) increased rating volatility, but intuitively I think it might have.

So if a new rating system can have a similar effect of rating decrease after a win or rating increase after a loss, be prepared to answer many questions about it.

2 Likes