So, there was a previous thread mostly about this, but it started out forked from another one and got very long and technical, so I expect not many people have read it or had a good idea what it was about.
I thought I would sum up the salient points here and keep going from there, so that people who don’t have much time can know where we’re at and have the opportunity to maybe jump in with topical suggestions.
And since I expect this thread to get quite long-winded and technical too, I plan to keep doing this periodically or something like that. At some point I might setup a meta-thread to link all of them together and sum them up, but for now I believe this is enough. Or maybe it’s simpler to periodically edit a summary of the recent developments into this first post, maybe make it a wiki.
First, a quick explanation of what we mean by "volatility" (click to show)
For those not in the know, when talking about volatility we’re talking about the “wild” swings in the rating that affect some players:
10 kyu, 14 kyu, 15 kyu?, 11 kyu; what level is this player?This, combined with the cultural perception that a player’s “true strength” is pretty stable over time, and with the fact that correspondence games affect the rating with a “delay”, leads many to speculate that these swings just aren’t accurate reflections of the player’s strength at any point in time.
So then, the discussion arose from @snakesss wondering how the handicap system can keep a reasonable winrate with the likely unreasonable volatility we have,[-] then me and @gennan discussing it.[-] To this, there are at least two possible explanations:
- in the cumulative winrate, excessive wins and excessive losses that shouldn’t happen balance each other out, leading to a misleadingly “good” winrate.
- the volatility in the rating system is actually succeeding in keeping track of when players are playing better or worse on a game-by-game basis, and thus the rating a player has when playing a specific game is usually correct.
I believe we all suspect (1) to be a much more likely explanation, but that’s not how scientific knowledge works. We need to perform an actual test to know for sure, by Jove!
But how do we actually test it?
@Jon_Ko pointed this out: since the the rating system we use is based on Glicko-2 and basically works by making a prediction on a user’s probability to win against another user (just based on their rating info), and then slightly adjusting the ratings based on how much the result differs from the prediction, then we can “ask” it what its prediction is, and check how accurate those predictions are game by game.[-]
He, @Allerleirauh, me and @paisley made proposals on how to quantify that accuracy. In the end, it turned out almost all of the proposals were equivalent: use “binomial deviance” to measure the accuracy of the predictions. [-] [-] [-]
@meili_yinhua provided a very good explanation of binomial deviance for dummies like me:
(Click to show explanation)
Other proposals were one by me to group up all the games where the rating system predicted a similar winrate (say, 55% for Black), and for each of these groups actually count the percentage of wins and compare it to the prediction[-]; and another by @Jon_Ko to use the mean squared error.[-]
Technically there was also a proposal, by more than one person, to compare the ratings of players who also have ratings in other sites or associations,[-] [-] but there would be no reason to assume one rating system better or worse than the other(s), so that would just mean delaying the inevitable need for a way to quantify the accuracy of each rating system. Still, if we could gather the data needed to do this, it would certainly be cool to perform the comparison.
Well, other than a few digressions and discussions about how the Glicko-2 rating system works, that’s the gist of it for now!
Now, here’s a list of aspects that I think we should discuss one by one, but I’ll write all of them together so that I don’t forget (I’ll edit more in if I think of more or if I get suggestions):
- We need to make sure we’re actually testing our current rating system and not a “ghost” of it.
- Even knowing what mathematical function to use to evaluate it, we need a “control” to measure it against, or I believe the measure is mostly meaningless.
- In fact, do we actually need to implement something like my virtual timeline framework in order to do that?
- I speculate that correspondence games probably have a much worse effect on one’s rating inaccuracy, so I think we should test them separately or something.
- Since we’re specifically testing whether the volatility is bad, would it make sense to put special attention on the games that happened when the rating was “far from the average” (of the current player’s strength)?
I think the first point is the most urgent to discuss, so I’ll write the first reply focusing on that (which is why I deliberately left it vague).