Rank Instability on OGS

Maybe I’m incredibly dense, but what did flovo do differently between the red line and the green line? Does it have higher or lower predictive power (since that’s the true point of ratings/rankings?

This here is actually a really serious issue with ratings systems that people do attempt to address (look Whole History Ratings), and is part of what RD is supposed to address (the “this is where we think you are” – not “this is what we think how far you can move”)

Sure, and that makes sense, but keep in mind that most ratings systems are, in a way, statistical theories (which is why they always come in academic papers), and thus need some sort of data to back up their claims.

It could be possible that the OGS implementation is not that great (I’d like to have more details from flovo’s implementation), but your problem either lies with the implementation or the theory, and the battle of theory is not an easy one to fight…

1 Like

In short: green is Glicko2, red is OGS.

In more detail:
The green line is a OGS like Glicko2 calculation. With OGS like I mean, I use the current rating of the opponents, not their rating at the start of the rating period (base rating).
For the red line, I had to artificially raise the deviation of the players base rating at the start of each rating period. I use the deviation provided by OGSs termination-api for the first game of the new rating periode, because I don’t know how they get calculated by OGS (as far as I can tell, it’s an undocumented feature of the OGS rating system).

For both lines I use:

  • rating period = 15 games or 30 days (30 days never applies in this special case)
  • initial player rating = 1500, deviation = 350, volatility = 0.06 (that’s what OGS uses)
  • τ = 0.6 (τ=0.3 or τ=1.2 could I’ve choose as well, they cause no visible difference in my tests)
  • For the current rating of the opponents I pull their rating history and look up their rating 1sec before the “current” rating calculation.

I want to note that after approx 45 games, the deviation doesn’t get lower anymore. It’s always between ≈90 at the start of a rating period, and ≈65 at the end of a rating period. This is a feature common to almost all players on OGS, playing multiple games per month, as you can easily verify here: OGS rank histogram (outdated) - #28 by DVbS78rkR7NVe

Glicko2 deviation drops below 34 at the end.

1 Like

I see, that makes sense, and might hint at an underlying problem, especially since with the 15 games (or 30 days) to a rating period, the ratings periods will often be staggered, (which is, as far as I know, no good for the system), and so the green line (while not perfectly in line with glicko-2) causes ratings to be more based on a more current skill.

I always thought the system we had in place for rating periods was a bit wonky.

So I suppose the real question is: does it have a higher accuracy in predicting wins and losses (even if it’s just for you) than the OGS implementation?

I never tested it.

I’ll have to think about how to measure this reliable. (Do you think squared difference is a good measure? [(1 - prop )^2 for win and (0 + prop)^2 for loss])

I’ll also have to find some rating histories with reliable player strength (equally good at all speed / sizes or only playing one) not playing many games against bots.
My own rating history is biased by the fact, that my 9x9 rank is about 13k and my 19x19 arround 17k. I just dropped 5 ranks because I switched to only 19x19 for 2 weeks.
But as I think about it, the rating system should cope with it. Will use my own history until I find a better one.

Will check it. (I can only check against opponent ratings provided by OGS)

Great point, great idea.

It’s interesting, actually, that the ostensible primary purpose of the rating system is to find even matches.

If that is the case, then ^^^ this is the primary question, and it will be fascinating if someone can figure out the answer.

But actually, I think there’s a softer purpose, possibly as important, which is communicating to us how “good” we are in some abstract sense. If that were not the case, we wouldn’t bother converting to Kyu/Dan.

It’s this second purpose that isn’t well served by a system that fluctuates wildly.

It might almost suggest that our K/D ranking should be “smoothed” relative to our Glicko rating at any given time…

Most systems I know of use a direct % of games where the winner was predicted by rating (1 for stronger wins, 0 for weaker wins, and take an average), but I have seen use of Root Mean Square as a measure, either one should work. (personally I prefer the RMS system)

Getting more varied data sets would also be reasonable, especially since we’re talking an entire server of players here.

I’ve first estimations.
I used the opponent ratings as provided by OGS (no recalculation of their rating history).
I only recalculated my rating history with OGS like Glicko2 (again with opponent ratings provided by OGS).

rms is calculated over (outcome - win_probability)
direct % as 1 - Σ(outcome XOR player_rating > opponent_rating) / number_of_games
(outcome = 1 if player win, and outcome = 0 if opponent_win

Prediction quality of my rating history: (rms: lower is better, direct %: bigger is better)
Player id: 449941
Games: 978

algorithm rms direct %
OGS 0.4645 64.93%
Glicko2 0.4637 65.95%

There is no difference in predictability.

To get the right values for Glicko2, I would have to recalculate the histories for the whole player base. At the moment I’m not able to do that (rating history gets cropped to max 5000 games).

EDIT: Forgot to adjust ratings for handicap games in the calculation of the direct %. Now both are slightly higher.

1 Like

Idk about that, 1% increased accuracy seems to be a lot as far as rating systems are concerned…

But yeah, a more in-depth study would be required before making any conclusions.

Turn to page 7.

He is right about it not being significant, though.

From a chi-square goodness of fit test (with OGS predictions being the expected values), the p value is 0.642182, which is very far from significant. Needs more testing before we can reach any conclusions.

p-values are close to useless, it’s more useful to calculate the effect size. :stuck_out_tongue:

P-values are necessary for any type of statistical research unless you want to fall into the gambler’s fallacy. Calculating past effect size is not quite the same as preparing for future data which can and will differ from past data.

This is why we need more research/data before we make any conclusions on the matter, since the 1% increase is suggestive, but not significant enough to be granted as fact.

p-values offer the probability of observing a given set of data (assuming…), effect sizes allow us to compare results of several studies. I wrote a whole chapter on this, hence my bias towards effect sizes. :slight_smile:

You could have a tiny effect even if you had a spectacularly low p, simply due to having a high N (or many, in the worst case even unadjusted-for comparisons… hello fMRI data).

I’m not sure how you can get a good measure of significance from that alone, though…

I’m not sure how to get a reliable measure of value either. Both (ogs and glicko2 predictions) share the same opponent_ratings. Therefor both values share to some extend the ogs-reliability.

I’m not sure how much one should trust even statistical significant results.

Well, it is pretty much guaranteed that we want to continue investigation, but the question of how we should measure the success of our experiment is of undoubted importance

Significance is about (i’m paraphrasing for dramatic effect, this is not the definition) “We’re this confident what we saw isn’t just due to us having sampled just the right data by accident.” Effect size tells you “Maybe it was by accident, but the difference between x and y is this big”.

You could try recalculating glicko for the player in question and his immediate opponents (if you have time, then for opponents of opponents and so on) and see whether it changes anything. It’d still be relying on OGS calculations but in lesser degree.

1 Like

That would be a shorter test (than the whole OGS database), we could see if the increase in accuracy still holds, and see if there’s a higher significance in that, or see if we need to investigate even deeper.

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.