Most systems I know of use a direct % of games where the winner was predicted by rating (1 for stronger wins, 0 for weaker wins, and take an average), but I have seen use of Root Mean Square as a measure, either one should work. (personally I prefer the RMS system)

Getting more varied data sets would also be reasonable, especially since we’re talking an entire server of players here.

I used the opponent ratings as provided by OGS (no recalculation of their rating history).
I only recalculated my rating history with OGS like Glicko2 (again with opponent ratings provided by OGS).

rms is calculated over (outcome - win_probability)
direct % as 1 - Σ(outcome XOR player_rating > opponent_rating) / number_of_games
(outcome = 1 if player win, and outcome = 0 if opponent_win

Prediction quality of my rating history: (rms: lower is better, direct %: bigger is better)
Player id: 449941
Games: 978

algorithm rms direct %
OGS 0.4645 64.93%
Glicko2 0.4637 65.95%

There is no difference in predictability.

To get the right values for Glicko2, I would have to recalculate the histories for the whole player base. At the moment I’m not able to do that (rating history gets cropped to max 5000 games).

EDIT: Forgot to adjust ratings for handicap games in the calculation of the direct %. Now both are slightly higher.

Idk about that, 1% increased accuracy seems to be a lot as far as rating systems are concerned…

But yeah, a more in-depth study would be required before making any conclusions.

Turn to page 7.

He is right about it not being significant, though.

From a chi-square goodness of fit test (with OGS predictions being the expected values), the p value is 0.642182, which is very far from significant. Needs more testing before we can reach any conclusions.

p-values are close to useless, it’s more useful to calculate the effect size. :stuck_out_tongue:

P-values are necessary for any type of statistical research unless you want to fall into the gambler’s fallacy. Calculating past effect size is not quite the same as preparing for future data which can and will differ from past data.

This is why we need more research/data before we make any conclusions on the matter, since the 1% increase is suggestive, but not significant enough to be granted as fact.

p-values offer the probability of observing a given set of data (assuming…), effect sizes allow us to compare results of several studies. I wrote a whole chapter on this, hence my bias towards effect sizes. :slight_smile:

You could have a tiny effect even if you had a spectacularly low p, simply due to having a high N (or many, in the worst case even unadjusted-for comparisons… hello fMRI data).

I’m not sure how you can get a good measure of significance from that alone, though…

I’m not sure how to get a reliable measure of value either. Both (ogs and glicko2 predictions) share the same opponent_ratings. Therefor both values share to some extend the ogs-reliability.

I’m not sure how much one should trust even statistical significant results.

Well, it is pretty much guaranteed that we want to continue investigation, but the question of how we should measure the success of our experiment is of undoubted importance

Significance is about (i’m paraphrasing for dramatic effect, this is not the definition) “We’re this confident what we saw isn’t just due to us having sampled just the right data by accident.” Effect size tells you “Maybe it was by accident, but the difference between x and y is this big”.

You could try recalculating glicko for the player in question and his immediate opponents (if you have time, then for opponents of opponents and so on) and see whether it changes anything. It’d still be relying on OGS calculations but in lesser degree.

That would be a shorter test (than the whole OGS database), we could see if the increase in accuracy still holds, and see if there’s a higher significance in that, or see if we need to investigate even deeper.

