Clarification and summary of past points relating to predictive accuracy and volatility, and why we might want to shy away from models like WHR
I think it’s worth summarizing once again since there’s been a lot of back and forth, though I hope that the intervening discussion has been at least partially rather than just frustrating Please let me know if there still any points of contention, mistaken assumptions, or outright incorrectness in the following:
An imaginary, black-box rating system that’s exceptionally (not just decently, but exceptionally) good at predicting outcomes on a given set of games will probably be uselessly volatile.
Part of this volatility may come from overfitting, which may be ameliorated in a number of ways.
Another part of this volatility may come from the fact that it’s just hard to accurately describe the chaos of human games and uneven interactions simply by comparing a single, relatively stable number. However, it depends on how you define “accurately”, with the more concerning definition being “percentage of games in which the player of a higher rank won”, or metrics derived from similar concepts. Many of @esporajam and other’s ideas for measuring accuracy are not of this type, but it’s worth noting especially given that the WHR paper did actually use the former type of metric.
This does not mean that a model which favors relatively stable rank can’t be relatively good at making predictions (however you quantify “good”), just that trying to push predictive accuracy too far may be counterproductive. It also doesn’t mean that a highly-accurate predictive model actually exists either. I just suspect that if it did exist, it would be highly volatile in the general case.
Another key dichotomy is systems that are implemented explicitly to describe win outcomes (like WHR) vs. systems that are implemented on concepts related to game outcomes/wins but not solely based on modeling observed outcomes (like Glicko). The latter type may also do a decent job of describing win outcomes, but the former is again more what I am concerned about in terms of introducing unwanted volatility. That said, I realize that my assumption that the former kind of model was even under consideration at all may be completely wrong! (sorry for so many former and latters…)
Models also differ in whether they are allowed to update all ranks when new data is available, or just the ranks involved in the new game records. The former case is largely what I’m concerned about, again because this kind of freedom is likely to result in volatility if the metric for “success” is a simplistic calculation of predictive success. There is also the consideration of how often rank updates should occur.
All of these implementation details may result in widely varying behaviors, and overall I just want to note that if the wrong metric is used to evaluate them, a model that looks really good on paper (in a predictive sense) may be not at all the desired ranking system for OGS.
A specific example if some of these concepts, as they relate to the paper on WHR (and a question for @meili_yinhua)
The model in the WHR paper is of the type that my concerns are most relevant to:
- It is explicitly designed to explain observed game outcomes in that it provides a maximum a posteriori estimate of the rank of each player, given a set of games
- In its pure form (ignoring the paper’s proposal of less accurate but more incremental updates), it calculates this estimate using all games across all players
- Its metric for success is basically just the percentage of game outcomes correctly predicted by the assigned ranks
Of particular note in this model, it’s possible that the MAP rank estimates at a given time, while being the best in a predictive sense, may overlook or misclassify certain players in the quest for global optimality. A future MAP estimate recalculated on new games (which may involve those players more often) may suddenly focus more on those players, altering their ranks more than one might expect. They do acknowledge this potentiality, and specifically construct their prior distribution “to avoid huge jumps in ratings”, but we would have to test what the MAP estimation process does to a specific player’s rank over time, especially in a chaotic server like OGS with new players, long correspondence games, large breaks in playing, etc. @meili_yinhua may be able to give a better idea of the extent of this problem and whether or not there exists any kind of upper bound on rank changes during the estimation process, as I’m still not fully familiar with the WHR method. If I’m interpreting correctly, there is an upper bound on this type of volatility, and in fact it is controllable through the w parameter used to describe Wiener process used as a prior. However, the question still remains of whether a sufficiently low w for our idea of “acceptable” volatility enables the WHR to still provide any significant gain in predictive accuracy over other types of models.
Some potentially inescapable volatility
This again goes back to the idea of “pools” of players that mostly play within themselves. It is possible that a system will assign stable, predictive intra-pool ranks. However, inter-pool interactions may then cause inescapable volatility as the relative ranks of people in both pools are adjusted.
Anecdotally, I’ve noticed on occasion that I’ll play a new (to me) player in a tournament, and find that they “feel” a bit stronger or weaker than I’d expect, even though their rank uncertainty is relatively low. After the tournament concludes, I’ll see that player’s rank stabilized to a value closer to what I might expect. This is again totally anecdotal, but it does provide potential evidence for the idea of apparent volatility stemming from different pools of players interacting suddenly and needing to stabilize relative ranks.
The WHR paper also considers this:
“For instance, if two players, A and B, enter the rating system at the same time and play many games against each other, and none against established opponents, then their relative strength will be correctly estimated, but not their strength with respect to the other players. If player A then plays against established opponents, and its rating changes, then the rating of player B should change too.”
Again, however, I believe their solution may not be what we want. In a model more similar to the current one, it might be worthwhile to consider not only rank uncertainty, but also rank uncertainty with respect to the specific players involved in a given pairing.
I will try to look at @esporajam’s post and all that came after in more detail too! I just felt it worthwhile to try and summarize and move forward with past discussion topics first.