Testing the Volatility: Summary

joooom · January 20, 2023, 4:29pm

The reason I hesitate is just because of terminology and the specific problem domain. When talking about a signal to noise ratio, the signal in question is usually very well defined, with nice properties we either know or can try to estimate (frequency, phase, etc). The noise itself is usually pretty well quantified too (e.g. white noise). It’s also generally clearer how this signal is combining with the noise (additive, multiplicative, etc).

In this case, it’s not immediately clear to what extent the “true” signal exists or how it might be described, or how the complex interactions resulting in the more “noisy” data we’re seeing might be quantified. Statistical concepts (like those that have been discussed – variance, random variables, confidence intervals, etc) are probably more appropriate. Not to say that there isn’t overlap (there’s a ton between statistics and signal processing), just that the analogy to signals might create some unwanted expectations.

All that said, I realize this is nitpicky and fully agree that the concept of a more stable rank may be obscured by “noisy” deviations.

The types of metrics I’m referring to are those similar to the WHR paper based on a raw proportion of correctly predicted games. On a dataset where everyone is somehow of the same “true” rank and wins nearly exactly 50% of the time against everyone else, a model that puts everyone at the same rank and predicts a fifty percent win rate all the time will of course get a score of about 0.5 under such a metric, as you said. A model that adjusts ranks frequently and somehow captures more than 50% of the game outcomes would get a higher score, even though the former model is actually what we want. I’m just saying that the other metrics without this property are better, as I believe you’ve been saying all along.

(Edit: ahhh… another potential issue here – I’m talking about the potential to accidentally and unfairly punish correct 50% predictions. I realize my original statement was open to confusing interpretation, maybe the revised one below helps)

This is probably obvious and has been brought up again and again, but the reason I wrote my thoughts out again was because of these statements:

I wanted to bring up the extreme case described above as a way of showing that the key to judging a good model (if the goal is to judge one by predictive accuracy) might be how well it stratifies players whose observed win rates are significantly different than 50%. I fully agree that a dataset where most matches are between players of similar skill could be problematic for this very reason, and I agree with @flovo that there may not always be enough data for a given player.

We probably agree and are saying the same things in general, just have different implicit assumptions or interpretations of each other’s posts (forums can be hard!).

Yes, this is something I’ve tried to be very clear about all along. Maybe this is clearer if I say

This may be a very obvious statement, but worth stating in the context of designing a metric for evaluating a model. I also realize, as you say, that

It’s perfectly possible that a system might happen to observe two players with a ~50% win rate, thinking they have the same “true” skill, when in reality there just isn’t enough data.
This is yet another reason that judging a ranking system by its predictive performance is potentially problematic, and that the evaluation metric needs to be very carefully considered.

Throughout this whole topic, I am taking the following assumptions:

The goal is to reduce perceived volatility of rankings
Systems will be (at least in part) judged by their predictive ability, by some metric

I am trying to point out various issues I can foresee with this approach, including how we define such a metric. Given various types of metrics, I am pointing out potential issues. In that light:

Will this necessarily happen? Not at all, but it is worth bringing up, not necessarily in an effort to solve it, but to further discuss the end goals and compromises/considerations that may come up along the way.

I apologize if any of this sounds terse, I just understand that forums can be a difficult way of communicating and want to make things as clear as possible. Often we all agree but conversation devolves because it is hard to keep track of everyone’s past statements.