Is rating volatility a bug or a feature? [Forked]

espoojaram · December 26, 2022, 9:02am

Oh. In my idea, by “predictions based on an averaged out rating” I was imagining to perform the prediction based on an average of the rating strictly before the game in question (a limited window, one way on another), so the average shouldn’t be aware of any future data point, only of whatever data was used to calculate the average for that point. I imagine that solves you concern?

I know it’s weird because such an average is always “late”, because the average is really most representative of the “center” of the averaging window. But I had anticipated exactly this problem, so that’s why I was thinking of it that way.

Uh, I don’t can you explain?

You know what, I realized something. Nevermind that the above test was intended just to test the hypothesis that the volatility was an unwanted fluctuation. I might be missing something, and the following might be either an exceptionally obvious idea or a very silly one, but here goes.

As long as we’re careful to not do that “knowing the virtual future” thing, and as long as we’re testing through winrate predictions, I don’t actually see any reason why we wouldn’t be able to genuinely test almost whatever rating system we want (to a somewhat limited degree but really not too much), without having to actually implement it as the only rating system in the site. Though this mainly goes for games without handicap.

(Click to see details - EDIT: THIS IS THE VIRTUAL TIMELINE FRAMEWORK)

Here’s my reasoning: what are the effects that a rating system has on the future games on the site? I’d argue that the main effects are:

it affects the matchmaking, though I’d argue all semi-decent rating systems affect the matchmaking in a very similar way, so the effects of this might be limited;
it definitely affects the handicap predicted;
it (slightly?) affects the performance of players because of the psychological effect of seeing the rank of the opponent, and being affected by being on an upswing or a downswing of the player’s own rank, and having a preconception about what one’s own rank “should” be.

At least for the moment, I don’t think we should care too much about point 3, even though I’m not sure it’s
irrelevant. I’m not sure point 1 is very relevant when it comes to testing the winrate predictions, but I should probably describe the methodology I’m thinking of, before we could discuss that.

Here’s the methodology:

Collect as many past games as you can process. For example, start from a random player, collect all of their ranked games, start going through their opponents and collect all of their ranked games, iteratively. At some point you stop and you remove from the sample all of the games where one of the players is a player that you hadn’t selected.

Hopefully now you have enough games to accurately sample the skill progression of most of the players you selected.

Sort the games in chronological order.
Now essentially simulate a virtual timeline and just run the rating system of your choice on your players (Maybe start each of them with a provisional rating equal to the OGS rating they had in the game you start with? We could try both ways out of curiosity). Obviously, at any point in the virtual timeline, only ever feed to the rating system data points from the past in the virtual timeline.
And, uh… see what happens.

I’d argue that with the exception of points 1, 2 and 3 above, this is pretty much exactly the same as if you had actually used this rating system in the real world. In fact, as a sanity check and out of curiosity, you could also run the very rating system that was actually in place and see if running it on a subset makes a difference. This could also be a good test of its solidity.

So does point 1, the matchmaking, matter? I don’t know. On the one hand, matchmaking is already imprecise in the real system by necessity, so it’s imprecise in the real one and it’s imprecise in the virtual one. If the general trend of the rating is the same, I think the difference shouldn’t be too much. Also, many games on OGS are not even automatched, and that definitely makes matchmaking slightly less of a factor.

But I would expect the average rating difference between players to be larger in the virtual system than in the real one. And in that case, the virtual rating system is expected to have stronger predictions than a real one. I genuinely don’t know if this makes for a better or a worse test of the virtual rating’s solidity though

Alright, so point 2, the handicap. In a way, I think this is only slightly worse than the matchmaking, as I imagine the main effect being that the simulated system would less often see handicap games where it thinks the handicap is “right”. But in handicap games, the performances of the players are often affected by the number of handicap stones, so this is definitely more of a problem.

So I’d say a virtual simulation should be expected to be less reliable when it comes to handicap games, although again I can’t wrap my head around whether this means it’s a better test or a worse test. Science is hard.

Alright, so, am I missing something? xD

Why would you assume ratings on other servers are more accurate? As far as we know, it could be the exact opposite, i.e. the ratings on other servers being unable to catch up with high-frequency fluctuations in players’ skills, even if it was the case that the ratings on other servers are smoother (I’m guessing they probably are, but I don’t know that for sure). That’s pretty much exactly what we were trying to test (or develop a test for) here.

This is a better idea, although it might already be addressed in Glicko-2’s structure? Still, even in that case it might be a good idea to verify with hard evidence how solid the rating system really is in this sense.

That’s just, like, your opinion, man. Again, such an idea is exactly what we were trying to test: we might have the perception that a volatile rank is inherently a bad thing, but if it so happens that our volatile rank makes for better matchmaking, better win% predictions and a better handicap system, then it just means our perception is wrong.

One thing that I do agree with you is that there’s a culture around the concept of a rank, so when people see another player with rank X, they start forming ideas in their head. So if the rank on OGS is volatile, it creates a perception that it’s “bad” and “unreliable” just because it doesn’t match this cultural presumption that a player’s skill should be stable over time. Also, it’s weird that a “on average 2 kyu” on a bad day might appear with the same rank as a “on average 10 kyu” on a good day. Also, some people get really psychologically attached to their rank and I think it’s needlessly psychologically vexing to have them see their rank constantly go up and down and feel random, which might easily make them feel that their efforts to improve aren’t bearing fruit because the improvement is hidden under all that noise.

For these reasons and maybe others, I agree that it would be nice to have a displayed rank that is very stable, and that’s why I keep proposing an “averaged out” version for that function specifically.

Well that was easy

It’s all the more hilarious that I said “analytically”. I guess what I was really doing was trying to prove it algebraically, as it never occurred to me to just study the function.

Ah, the good ol’ mean squared error. Now this is something simple enough that even I can understand, so I can get on board

Although I’m thinking, since the only reasonable way to use it, it seems to me, would be to use 1 and 0 for the observed result, could it be a problem that most of the predicted values are very close to 0.5 (because most games, especially the ones we care about the most, have people playing on a supposedly even field)?

It feels like, for any given game considered, most of the difference between a good rating system and a bad one is squashed in pretty much the flattest part of the parabola (around 0.5^2 = 0.25), so I’m worried that it might be too susceptible to random noise.

Well, I was originally going to brainstorm and propose a few ideas for my preferred rating systems to test out, but I think this message is already long enough

Thank you for reading!

now, since the topic has been forked, many quotes have been broken, so I’m gonna spend some time trying to fix what I can in my own posts…