Is rating volatility a bug or a feature? [Forked]

Oh, ok. Then I guess you need to either divide by the number of games, like an arithmetic mean, or some other kind of average, or otherwise adjust the expectation based on the number of games.

Intuitively the sum of those points should be greater than half the number of games if it gets most predictions right, less if it gets most wrong. If it’s close, uh… I have no idea. I’m starting to think we need a better test :laughing: or maybe just someone who knows more about statistics than me!

EDIT: maybe I got something. If you group those predictions in buckets of comparable probability (e.g. all the ones where 0.55 <= p < 0.56), you can see in what percentage of games in that bucket the rating system guesses right, and if it’s good, you’d expect the percentage to be equal to the prediction.