Is rating volatility a bug or a feature? [Forked]

meili_yinhua · December 25, 2022, 7:13pm

I don’t particularly like “testing on smoothed out rating” unless the smoothing specifically affects current games as an edit to the ratings system

Remember that ratings are a predictions system, so editing past ratings should be done with the explicit purpose of making future predictions better

espoojaram · December 25, 2022, 7:33pm

I had a suspicion… Well, I have to say this feels like we basically are already using a

xD

hmmm, I don’t really understand this part. I guess intuitively the sliding window system was in a weird way performing an artificial smoothing out of the rating, which perhaps hid the volatility and made it less effective at capturing uncertainty about the “true rating”.

Yes, I agree with that, and thank you for reminding me.

But that’s not exactly what I meant with my idea, I was more asking if the test I proposed could be considered statistically significant – if an artificially smoothed out rating can predict game results better than the actual volatile rating even in past games, doesn’t that give strong evidence that the volatility wasn’t an accurate reflection of the player’s true skill moment by moment? In this case, that’s the hypothesis I wanted to test specifically.
I feel that if averaging out performs better even in past games, that’s even stronger of a piece of evidence.

meili_yinhua · December 25, 2022, 7:40pm

Okay, the issue with this is like I mentioned before, but I’ll reword it slightly: it needs to be able to make these predictions without knowing future data points. That’s what I mean by having it “affect current rating”

well, the issue is that true ratings movement is modeled as a random walk (or brownian motion) over time, so what volatility does is it adds the deviation of that random walk to the ratings deviation.

The issue with the sliding window, is it used the same volatility update for a 3-month period as a 3-day period so long as there were 15 games, which is fine as long as players don’t switch up their rate of playing, which they do. It’s fine if they don’t because the volatility should adjust to be roughly about the deviation it should be for their pace, but as soon as it switches, say they’re having ratings periods in a quarter of the time they used to, then suddenly the volatility is essentially giving twice the update to the Ratings Deviation as it should (until it adjusts to that pace)

Jon_Ko · December 25, 2022, 7:41pm

I mean, it’s a quadratic function, I’m sure if you try once more, you can figure it out. If you want too, otherwise just plot it.

Just playing around with the reward function yourPointsIfXWins.

meili_yinhua · December 25, 2022, 7:46pm

and that’s not even getting into how the “smoothing” mostly creates noise

Jon_Ko · December 25, 2022, 8:11pm

Maybe Brier score - Wikipedia does the trick.

gennan · December 25, 2022, 8:24pm

I took the liberty of splitting this discussion off from the original topic.

Edit: it looks like I missed a few posts between post #40 and #50 in the original topic that should be in here too. But if I do that now, those posts will be incorrectly ordered by discourse.

jannn · December 26, 2022, 6:18am

If the idea is that handicaps may be wrong because of the players’ rating fluctuates too much (randomly suggesting too high or too low handicaps), then this fluctuation itself could be looked at.

For example, compare rank graphs of this server to other servers (only players with hundreds of recent games oc). If lines tend to be smoother elsewhere (for similar levels) then ratings here are likely inaccurate (random-noisy) as hypothetised. (Another test is how much 10 game streaks move already-solid ranks.)

A player may play better one week than the other, but this is exactly what a rating system should NOT try to follow (same noise like in a single game he may play better than the next game - yet ratings average over hundreds of games for good reason).

meili_yinhua · December 26, 2022, 7:27am

I mean, it should definitely try to follow it for the week that the player is playing better, provided that has enough games in it, as it will boost predictions up until that players’ apparent demonstration of skill begins to flag. Anything else is a bias-variance tradeoff that depends entirely on the dataset that predictions will be made on

I’m not gonna focus too much on the “average over hundreds of games” clause cuz I’m sure you know why we don’t use averages in most modern ratings systems

meowkorkor · December 26, 2022, 8:28am

Just wondering if OGS ratings are inflated or simply misunderstood by beginners.

In this game, my opponent asked me whether I was 6 kyu (this was before I even finished a game):

In another game, my opponent (OGS 13-14 kyu) started with saying “I don’t know how to play. I hope you can give me more advice!” That does not sound like a 13-14 kyu. I thought a 13-14 kyu would be able to kill my top group and would not spend 50 moves on an obviously pointless invasion.

espoojaram · December 26, 2022, 8:39am

@meowkorkor: I know the title of the topic says “thoughts on […] the OGS rating system”, but we’re kind of in the middle of a long, technical, and quite focused discussion here

(I’ve been working since yesterday on a very long reply to the previous discussion, that I will post, uh, at some point today – edit: and it’s also the reason why this reply is directed to the wrong person, ooops)

You should probably start another forum topic for those questions

espoojaram · December 26, 2022, 9:02am

Oh. In my idea, by “predictions based on an averaged out rating” I was imagining to perform the prediction based on an average of the rating strictly before the game in question (a limited window, one way on another), so the average shouldn’t be aware of any future data point, only of whatever data was used to calculate the average for that point. I imagine that solves you concern?

I know it’s weird because such an average is always “late”, because the average is really most representative of the “center” of the averaging window. But I had anticipated exactly this problem, so that’s why I was thinking of it that way.

Uh, I don’t can you explain?

You know what, I realized something. Nevermind that the above test was intended just to test the hypothesis that the volatility was an unwanted fluctuation. I might be missing something, and the following might be either an exceptionally obvious idea or a very silly one, but here goes.

As long as we’re careful to not do that “knowing the virtual future” thing, and as long as we’re testing through winrate predictions, I don’t actually see any reason why we wouldn’t be able to genuinely test almost whatever rating system we want (to a somewhat limited degree but really not too much), without having to actually implement it as the only rating system in the site. Though this mainly goes for games without handicap.

(Click to see details - EDIT: THIS IS THE VIRTUAL TIMELINE FRAMEWORK)

Here’s my reasoning: what are the effects that a rating system has on the future games on the site? I’d argue that the main effects are:

it affects the matchmaking, though I’d argue all semi-decent rating systems affect the matchmaking in a very similar way, so the effects of this might be limited;
it definitely affects the handicap predicted;
it (slightly?) affects the performance of players because of the psychological effect of seeing the rank of the opponent, and being affected by being on an upswing or a downswing of the player’s own rank, and having a preconception about what one’s own rank “should” be.

At least for the moment, I don’t think we should care too much about point 3, even though I’m not sure it’s
irrelevant. I’m not sure point 1 is very relevant when it comes to testing the winrate predictions, but I should probably describe the methodology I’m thinking of, before we could discuss that.

Here’s the methodology:

Collect as many past games as you can process. For example, start from a random player, collect all of their ranked games, start going through their opponents and collect all of their ranked games, iteratively. At some point you stop and you remove from the sample all of the games where one of the players is a player that you hadn’t selected.

Hopefully now you have enough games to accurately sample the skill progression of most of the players you selected.

Sort the games in chronological order.
Now essentially simulate a virtual timeline and just run the rating system of your choice on your players (Maybe start each of them with a provisional rating equal to the OGS rating they had in the game you start with? We could try both ways out of curiosity). Obviously, at any point in the virtual timeline, only ever feed to the rating system data points from the past in the virtual timeline.
And, uh… see what happens.

I’d argue that with the exception of points 1, 2 and 3 above, this is pretty much exactly the same as if you had actually used this rating system in the real world. In fact, as a sanity check and out of curiosity, you could also run the very rating system that was actually in place and see if running it on a subset makes a difference. This could also be a good test of its solidity.

So does point 1, the matchmaking, matter? I don’t know. On the one hand, matchmaking is already imprecise in the real system by necessity, so it’s imprecise in the real one and it’s imprecise in the virtual one. If the general trend of the rating is the same, I think the difference shouldn’t be too much. Also, many games on OGS are not even automatched, and that definitely makes matchmaking slightly less of a factor.

But I would expect the average rating difference between players to be larger in the virtual system than in the real one. And in that case, the virtual rating system is expected to have stronger predictions than a real one. I genuinely don’t know if this makes for a better or a worse test of the virtual rating’s solidity though

Alright, so point 2, the handicap. In a way, I think this is only slightly worse than the matchmaking, as I imagine the main effect being that the simulated system would less often see handicap games where it thinks the handicap is “right”. But in handicap games, the performances of the players are often affected by the number of handicap stones, so this is definitely more of a problem.

So I’d say a virtual simulation should be expected to be less reliable when it comes to handicap games, although again I can’t wrap my head around whether this means it’s a better test or a worse test. Science is hard.

Alright, so, am I missing something? xD

Why would you assume ratings on other servers are more accurate? As far as we know, it could be the exact opposite, i.e. the ratings on other servers being unable to catch up with high-frequency fluctuations in players’ skills, even if it was the case that the ratings on other servers are smoother (I’m guessing they probably are, but I don’t know that for sure). That’s pretty much exactly what we were trying to test (or develop a test for) here.

This is a better idea, although it might already be addressed in Glicko-2’s structure? Still, even in that case it might be a good idea to verify with hard evidence how solid the rating system really is in this sense.

That’s just, like, your opinion, man. Again, such an idea is exactly what we were trying to test: we might have the perception that a volatile rank is inherently a bad thing, but if it so happens that our volatile rank makes for better matchmaking, better win% predictions and a better handicap system, then it just means our perception is wrong.

One thing that I do agree with you is that there’s a culture around the concept of a rank, so when people see another player with rank X, they start forming ideas in their head. So if the rank on OGS is volatile, it creates a perception that it’s “bad” and “unreliable” just because it doesn’t match this cultural presumption that a player’s skill should be stable over time. Also, it’s weird that a “on average 2 kyu” on a bad day might appear with the same rank as a “on average 10 kyu” on a good day. Also, some people get really psychologically attached to their rank and I think it’s needlessly psychologically vexing to have them see their rank constantly go up and down and feel random, which might easily make them feel that their efforts to improve aren’t bearing fruit because the improvement is hidden under all that noise.

For these reasons and maybe others, I agree that it would be nice to have a displayed rank that is very stable, and that’s why I keep proposing an “averaged out” version for that function specifically.

Well that was easy

It’s all the more hilarious that I said “analytically”. I guess what I was really doing was trying to prove it algebraically, as it never occurred to me to just study the function.

Ah, the good ol’ mean squared error. Now this is something simple enough that even I can understand, so I can get on board

Although I’m thinking, since the only reasonable way to use it, it seems to me, would be to use 1 and 0 for the observed result, could it be a problem that most of the predicted values are very close to 0.5 (because most games, especially the ones we care about the most, have people playing on a supposedly even field)?

It feels like, for any given game considered, most of the difference between a good rating system and a bad one is squashed in pretty much the flattest part of the parabola (around 0.5^2 = 0.25), so I’m worried that it might be too susceptible to random noise.

Well, I was originally going to brainstorm and propose a few ideas for my preferred rating systems to test out, but I think this message is already long enough

Thank you for reading!

now, since the topic has been forked, many quotes have been broken, so I’m gonna spend some time trying to fix what I can in my own posts…

Atorrante · December 26, 2022, 9:22am

When I set a challenge and check my ratings. there is often a difference between the profile rating and the custom games rating.

Schermafbeelding 2022-12-26 om 10.10.33

(profile)

(custom games)

If I jump back from my profile rating to the custom games page, the rating is adjusted.

Sometimes the difference in rating between profile and custom games rating is as much as 0.9 kyu. This is rather confusing.
The custom games rating seems to be lagging behind.

Can someone explain this phenomenon to me?

jannn · December 26, 2022, 9:36am

Realistically, this would almost certainly lead to chasing random patterns only. Even with many games, we don’t expect his real abilities to change on a weekly basis (fast-learning beginners aside). And things like “form” or “focus” - even if not just illusions from variance - don’t necessarily carry consistently from one game/day/week to the other. Solid rating systems and better predictions are based on long term expectations - things that do have a tendency not to change randomly and be predictable.

Because it is not those servers where handicap and fluctuation related complaints are investigated, and also if two other servers are similarly smooth and OGS is noisy, it is less likely that others are wrong.

espoojaram · December 26, 2022, 9:51am

Probably anoek, if anyone?

@jannn:
Listen, you can keep repeating your theories as many times as you want. We pretty much discussed what you’re talking about, and more thoroughly imho, at the very start of this topic – hopefully you’re not coming in without having read that. So then what we were discussing was exactly a way to test if your theory is right.

If you understand scientific methodology, you should be completely on board. If you’re not on board and you don’t bring valid criticisms to the table, and so far you haven’t as far as I’m concerned, it means you probably don’t reason scientifically and you gon get 'gnored. Maybe other people will muse you more than me. I hope I haven’t been rude, have a good day.

Also, wow, I guess a lot of people were anxious to jump in with their doubts about the rating system, I’m smelling another fork soon

Jon_Ko · December 26, 2022, 10:06am

I think match making is just for the humans, because they want good chances for both players. A rating system should be able to handle all possible sets of matches, the games don’t have to be equally winnable by both players.

gennan · December 26, 2022, 10:37am

I think this comes down to the question if high frequency fluctuations in players’ move quality are better described by a random process or by a process with some time correlation.

One can even increase the time resolution of the rating calculations to individual moves. Some moves of a player are blue moves (virtually no point loss, 13d level), while other moves are silly blunders (50k level). But even if it were possible to predict those high resolution wild “skill” fluctuations and account for that in players’ ratings, I don’t think it would be useful. What would be the purpose? Better matchmaking or better handicap calculation?

espoojaram · December 26, 2022, 11:03am

Well, yes. Or “both”, if you were asking which one

TL;DR: But that's just, like, my opinion, man.

I mean, people can of course disagree on what they wish the rating system did, but, especially on an online playing platform, it seems to me personally that better matchmaking and better handicap calculation are… basically the whole point of having a rating system in the first place. And if there’s a problem with the volatility, it’s first and foremost if it comes at the expense of matchmaking and handicap. (Again just my opinion)

You might argue that it’s more instructive, and perhaps even fun by being more wildly unpredictable, if the rating is more “lax” and players are just matched with players “loosely around” their level.

I don’t know. I’d say this is a matter to be investigated with surveys and things like that maybe they were even already performed on OGS

Also, as I said before, I would agree with you that better matchmaking and better handicap needn’t be the purpose of a good ranking system, especially if it comes at the cost of rank stability.

The displayed rank has a lot of implications on how the players perceives themselves and other players, and is intertwined with cultural expectations about what a player should be able to do at a certain level (see e.g. the fork-worthy post by meowkorkor above), so I personally feel that the purpose of the displayed rank is instead to match that cultural expectation as well as possible.

But if people mostly agree that better matchmaking and better handicap calculation is in fact what they want, we’d like to have a way to test how good a rating system is at that.

And even if people who agree with me are just a minority, I don’t think we’re hurting anyone by investigating these matters in our own corner, as nerds do.

You can say “what would be the purpose” about pretty much everything in life, and about life itself of course. I think the only good answer is usually “to have fun”

Allerleirauh · December 26, 2022, 11:57am

Today’s 13ks are yesterday’s 20ks. But ranks are quite inflated everywhere.

gennan · December 26, 2022, 4:10pm

I changed the title of this thread according to a suggestion by @espoojaram. I agree that it reflects the forked topic better.