Well, more than anything else, if it turns out that a very simple stabilizing of the rating performs better predictions, on past data, than the current algorithm, then to me that sounds like extremely strong evidence that the volatility is not helping the accuracy at all.
It doesn’t sound to me at all like that would be overfitting the past data.
– but you’re right that this is a point I hadn’t considered enough in the past. For example, if you started exploring the parameter space to tweak the rating system and measure its performance on past data, I can see how that might lead to overfitting. Intuitively, it might be related to this.
I wonder to what extent that could be solvable the same way they sometimes do in machine learning: you train the fitting on one subset of data, and you test it on another set of data.
Still, I’d like to see some evidence. One can appeal to whatever big theories they want, but until they perform an experiment that could falsify their predictions they’re mostly empty words.
My words – my belief that the current level of volatility is not useful – are empty too, and I’d like to perform a test able to falsify them.
I would of course also be happy to just see (convincing) evidence gathered by someone else. “Hey, we did perform a test like what you were talking about, here’s the results” – my impression is that the closest thing to this that has been done so far is the “lumping” winrate test, and discussing how that isn’t necessarily significant is how this thread has come about.