I think we need to discuss the consequences of this, in terms of testing purposes.
So far, I’ve been assuming the obvious way to perform such a test is something like the idea I came up with for a “virtual timeline”, which is just doing the same calculations that “would happen” if the rating system was being used in real life and the same sequence of matches happened.
(Here's the original full explanation of the idea, just for reference, but you don't need to read it)
So, because of the way matches are chosen, most of them are between players with similar rating (using what I’m going to call the “real rating system”), which means most of the predictions in the real rating system are close to 50%; whereas in the “virtual rating system” that’s not necessarily the case.
In my original proposal I anticipated this concept and made these considerations:
So, we’ve now discussed the idea that most evaluation metrics, if not all, seem to punish wrong predictions close to 50% less than wrong predictions far from 50%.
I’m trying to think about this, but I’m unable to reach a conclusion.
Do you think this gives an edge to the real rating system, because its wrong predictions, closer to 50%, are punished less? Or do you think this gives an edge to the virtual rating system, because “it’s easier to correctly predict the result if the players have very different ratings”?
Or does it not matter and whichever system gets the “edge” is just the one making better predictions?
There’s a little intuitive voice in my head saying that the answer actually depends on whether you’re using a metric that evaluates games individually, like binomial deviance, or a metric that evaluates based on percentage of correct predictions, like the WHR or the one I had come up with:
But I have no idea how to reason thoroughly about this.
However, as you mention, this is on a per-game basis and averaged across all games, so “incorrect” 50/50 predictions aren’t punished as much as they could be. A system that guesses 50/50 every time will score 0.693 by this metric – very similar to the results in my small run shown above, which makes sense given the rank stratification (maybe contradicting the “lower is better” note in this case, at least to some degree).
The fact that the database likely contains a lot of games between players of similar skill level also probably contributes to this observation. It might be interesting to see what happens when the metrics aren’t grouped by rank bands, or are grouped differently to test rank separation. For example, one could evaluate a similar metric on games in which the ranking system thinks the players are in the same integral rank (expecting roughly 50% predictive accuracy), and then again subsets of the dataset involving players that the system thinks are of significantly different ranks. Your suggestion about grouping games by predicted win rate is similar and could also be worthwhile.
Then again, it’s unclear whether any of these ideas are any better of a metric for evaluating a rating system, and as @flovo mentions, they might be problematic when there just isn’t enough data for certain players over a given time period. Again, probably worth trying all of these out just to see.
On that note, we might make similar tables but where each row is stratified by rank difference (e.g. rows could have rank differences of [0, 1), [1, 2), [2, 3), etc). In a good system, you’d expect predictive accuracy to increase up from 50% across these rows. I understand that the original tables were more concerned with maintaining roughly equal win rates across different handicaps, so this might be a useful addition for comparing the old system to modified ones.
Edit: if we keep the handicap columns, you’d expect the 50% continue along the diagonal in such a table
Funnily enough, I was so skeptical of the current level of volatility that I had actually speculated this might happen – that a system just predicting 50% all the time wouldn’t necessarily “perform” worse than the current one, at least in games between players who are reasonably close in rating
On a related note, I have to admit that when you said
I was surprised, because I had actually been thinking kind of the opposite – intuitively it’s pretty easy for any decent rating system to predict that a much stronger player is going to win, and volatility is somewhat “bounded” because, for example, if a “truly 6 kyu” player is misranked by the system as 3 kyu, they will start losing most games and the rating will naturally be brought closer to the “true rating”.
So even a volatile system can avoid “egregious” mistakes, and if you ask it “who do you expect to win?” it will almost always answer correctly for games between mismatched players.
AFAIU this is exactly what the WHR evaluation metric did, so I’m thinking it might actually be a very good metric to evaluate games between closely matched players (perhaps a “lumping” metric like the “naive” one I proposed might be even better?)
But now I’m starting to realize that what you meant was probably that the devil is in the exact winrate predicted.
If a rating system can more accurately predict the exact winrate between mismatched players, it should mean it’s matching the probabilistic model underlying the rating more accurately, and thus that the mismatched players were more accurately rated.
Anyway, I agree that we should just use all the decent metrics we can think of and hope that they all come to the same conclusion, I was thinking that from the beginning
Oh, and to come back to the question at hand (whether a metric “gives an edge” to the real rating system or to the virtual one), just to be clear, of course ideally we would verify that the metric that doesn’t give an edge to either one;
but if that isn’t the case, the second best case scenario would be if the metric gave an edge to the real system, because of scientific reasons: in our test we’re trying to “prove” that there is a better system, so if the virtual system outperforms the real one even with a metric that favors the real one, then it’s even stronger evidence.
It’s important to note that this is occurring in the tables I posted above precisely because the ranking system is doing a good job at ranking people of similar skill closely, so the caveat “at least in games between players who are reasonably close in rating” is key. A system that predicts 50% all the time for everyone would be significantly worse, and obviously wouldn’t assign rankings correctly at all. Sure it would perform the same within the bands of the table above as you mentioned, but the only reason we have that table stratified at all is because of the current ranking system.
You did note this, but I want to restate it in these terms to point out why I’m suggesting we also look at predictions between ranks – because it will give a better picture of how modifying the volatility affects the system’s ability to assign predictive ranks as a whole.
This idea is probably summed up best in response to this statement:
I’m saying that it’s not necessarily even a volatile system that can avoid these mistakes, it might be especially a volatile system that can avoid these mistakes. In order to see how a less volatile system performs, we need to evaluate it on both games it thinks are evenly matched and games it thinks aren’t. Any given system will have a different idea of which games are mismatched – “mismatched” is in the eye of the system assigning rankings itself. It could happen that a less volatile system still does a pretty good job identifying people of similar skill, but is slightly less predictive when considering people who differ greatly in the ranks it assigns (which is relevant for handicaps). (It could also be more predictive, or not significantly different)
Right, and that’s part of the advantage of having some degree of volatility. I’m just saying that it’s unclear how much reducing volatility will affect the system’s ability to perform that kind of correction as necessary. It all comes down to what types of corrections are necessary and how often they occur. I still fully acknowledge that a less volatile system might be better, just want to add yet another way of comparing.
Exactly, as this is directly related to the ranks the system assigns.
So, do you think separating matches into buckets depending on what we’ve been discussing might be useful? Specifically:
Would it be useful to subdivide matches into “matches between closely matched players” and “matches between mismatched players” given one rating system being tested?
Would it be useful to instead subdivide games into “games about which the real and virtual ratings systems have significantly different win% predictions” and “games where they have similar predictions”?
I realized this is pretty much an upgraded reformulation of the idea I had proposed here:
Which reminds me, I have to update the first post in this thread with a summary of what we’ve said so far!
I know it’s a lot of work, but since I believe I only have a superficial understanding of the arguments that have been brought so far, could I ask you to kindly rewrite the summary you wrote here, but in the form of a bullet point list? That would help me a lot
(Well, it would also be ideal to also express it as much as possible in terms that a layman could understand, but I definitely won’t blame you if you can’t )
I would consider using a number of buckets, but not necessarily as a “metric of performance” but as a sort of “looking for correlations that could lead to a better system”
like we might find that certain players (maybe correlated around a rank) might have a high subjective volatility due to having a wider spread of “performance abilities” centered around a “true rating”, leading to a higher overall variance in results (My experience makes me extremely suspicious of the assumption that any two players with the same “true rating” have the same chance to win against any number of players of different ratings, despite being roughly similar, even if this assumption makes elo/glicko math simpler)
Like, as I see it, at the end of the day, there’s one metric, or potentially “pseudometric” made from a combination of multiple and human intuition, that we make decisions on. If certain buckets are “interest killers” or even a major part of out “pseudometric”, then absolutely test them, the rest are “primary research data”
Hm, you know what, there’s a subtlety here. I believe the phenomenon I described is an advantage that comes from a volatile matchmaking more so than a volatile rating.
Some amount of mismatching gives (or is expected to give) the system more info than if the players were always matched perfectly (in terms of “schmating” – ok, this is getting ridiculous, is there really not a pre-existing technical term for that? I don’t know, “percentile rating” might be good?)
I suspect a less-reactive rating would be able to get about as much “information” (in signal-to-noise terms) from those mismatches than a volatile one.
One drawback is that it would be slower to react in case the rating estimate was very wrong, but considering what our objective is and the kind of scenarios we’re trying to prevent (i.e. most users on OGS have a fairly stable center of undulation, and we basically want to “bring that to the surface”), this might not be a problem.
(Also, intuitively, I think multiple-game ratings periods would be able to somewhat circumvent this problem in a different way, as I talked about before (the analogy to waves). They might be able to both be unreactive to high-frequency noise and reasonably reactive to being significantly wrong.)
But this brings me back to the conundrum we were talking about: the purpose of the rating system is the matchmaking, so in practice a lot of matchmaking (both automatic and manual) is based on the rating system, which means that if the rating is less volatile, then the matchmaking, usually, will be too, which I expect will cause volatility for a different reason
I’m feeling that basically there’s a very significant trade-off between the accuracy of the matchmaking and the accuracy of the rating. If you want the rating to be more accurate, you have to sacrifice the accuracy of the matchmaking.
The other direction feels more complicated though; if you want to increase the accuracy of the matchmaking, you can’t get an accurate rating system, but if you don’t have an accurate rating system, the matchmaking will necessarily also be inaccurate.
So for maxming the matchmaking accuracy, there’s probably a sweet-spot in the trade-off.
But most relevantly for our purposes, if I’m right about this, then I think it means that the virtual rating system probably does get an edge over the real one during the testing (because the matchmaking following the real rating system causes a feedback that “confuses” the real rating system a little bit, at least in the case of instant ratings), which is the worst-case scenario
I believe me and @joooom talked about this at length when this interaction began (we talked about latent spaces, intransitivity and whatnot):
So yes, if you could have more information about the playing style of the players, having the same “true rating” (in this case defined as “percentile rating” – defined here) wouldn’t imply they have the same probability of winning against a third player, but since you don’t have access to that information, all the probabilities cancel out, giving you the same expected winrate if the percentile rating is the same;
and if you had enough data, you should expect real-world data to match that, as long as the “percentile rating” was correctly assigned to both players.
EDIT: Oh, actually, I just realized this reasoning only holds for two players who have the same percentile rating. It’s possible that this kind of reasoning might be generalized for different percentile ratings too, but the argument as it is now doesn’t work, because the symmetry I described doesn’t happen. Hmmm.
I’m not even referring to intransitivity. I’m referring to “performance variance”
because let’s divide players into two groups: solid players and expiramental players. The solid players play what they know and don’t try anything beyond what they’re very confident they can’t be punished for, whereas experimental players frequently try out ideas that they’re not quite sure of, but it’s not clear that it’s bad.
Now take an experimental player and a solid player that have a 50% winrate and the same “true rating” with no other clear style difference or notable piece of intransitivity:
It seems to follow that the experimental player has a higher (albeit still less than 50%) chance of winning against a stronger player than the solid one, as the more often their ideas succeed, the better their relative performance is, whereas the solid player has a higher chance of winning games against weaker players, as they’re less likely to blunder and give the opponent chances
The idea being that the experimental player has a more variant performance, but the glicko/elo formulations assume exactly one “performance variance” across all players (which defines the elo/glicko scales)
Hmmm. Your reasoning is too abstract for me right now.
One thing that does come to mind, at the risk of doing the obnoxious physicist thing, is that performance variance might just be considered another parameter axis in the feature/latent space, and thus what you’re referring to might simply be modeled as a specific form of intransitivity – and if it’s possible to prove that intransitivity doesn’t affect the “transitivity of the win% over percentile rating strata”, then it would follow from these two things that performance variance also doesn’t?
But even then, I’d agree that assuming one constant performance variance across all players is not a justifiable assumption per se.
Mkay, so I’ll describe why I feel like this “performance variance” matters a bit to the currently existing model and updating method, and that is that the players with more variant performance relative to their “true ratings” (be it from being “experimental” or based on “form” – asumming form doesn’t correlate nearby games performance without being a function of a “true ratings change”, which would require a new layer to the model, or some other factor), that each game should provide less indication of strength or weakness to move around the “estimated rating” as would happen for less performance variant players, as upsets are more likely to happen. And if they’re assumed to have the same performance variation, they will have a much more “subjectively volatile” rating within a ratings system than those other players
A model that is extremely accurate in terms of raw percentage of game outcomes it predicts correctly will probably be very volatile. This is more relevant to a model like WHR whose main goal is to chase predictive accuracy, and may not matter as much for us if we’re just minorly tweaking the current system.
The metric used to evaluate “predictive performance” is very important
It’s also worth considering how and when any specific system makes changes to a player’s rank. The more often (in the extreme case, modifying everyone’s ranks after a new game outcome is encountered), the more the above points matter
A number of sources (skill variability, good and bad days, player pools interacting) may introduce volatility that’s hard to avoid
Strongly agreed. Maybe that helps add some context to the earlier statement “the key to judging a good model (if the goal is to judge one by predictive accuracy) might be how well it stratifies players whose observed win rates are significantly different than 50%.” Maintaining 50% win rates within ranks is of course important, but in order to compare one already good system to a modified version, it’s worth checking if/how these other rates change and whether it’s possible to maintain anything relatively consistent. It could also be a useless metric subject to a ton of noise, but we’ll find out. It also might be that maintaining ~50% win rates for handicap games where the rank difference equals the handicap is a good enough proxy for these measurements.
On that note, I’ve started generating some other stats from the current ranking system (the exact numbers give a good idea of some trends, but are not final as I only evaluated a small subset of the dataset):
How often the stronger player wins (in games with no handicap)^
To summarize, I’m just interested in seeing how these “pseudometrics” (thanks @meili_yinhua) change when evaluated on the same data for a given proposed modification to the ranking system. If there are any others you want to try out let me know.
I think part of that feeling is just the nature of looking at other people’s code in general, especially research-oriented code. I’m happy to try and provide clarifications (and I imagine anoek wouldn’t mind either – as the original author he might give more insight about why things were done the way they were).