Testing the Volatility: Summary

I think we need to discuss the consequences of this, in terms of testing purposes.

So far, I’ve been assuming the obvious way to perform such a test is something like the idea I came up with for a “virtual timeline”, which is just doing the same calculations that “would happen” if the rating system was being used in real life and the same sequence of matches happened.

(Here's the original full explanation of the idea, just for reference, but you don't need to read it)

So, because of the way matches are chosen, most of them are between players with similar rating (using what I’m going to call the “real rating system”), which means most of the predictions in the real rating system are close to 50%; whereas in the “virtual rating system” that’s not necessarily the case.

In my original proposal I anticipated this concept and made these considerations:

So, we’ve now discussed the idea that most evaluation metrics, if not all, seem to punish wrong predictions close to 50% less than wrong predictions far from 50%.

I’m trying to think about this, but I’m unable to reach a conclusion.

Do you think this gives an edge to the real rating system, because its wrong predictions, closer to 50%, are punished less? Or do you think this gives an edge to the virtual rating system, because “it’s easier to correctly predict the result if the players have very different ratings”?

Or does it not matter and whichever system gets the “edge” is just the one making better predictions?

There’s a little intuitive voice in my head saying that the answer actually depends on whether you’re using a metric that evaluates games individually, like binomial deviance, or a metric that evaluates based on percentage of correct predictions, like the WHR or the one I had come up with:

But I have no idea how to reason thoroughly about this.

1 Like