Rank Instability on OGS

Complaint against unstable ranks can go like this. When ratings become too unstable, they lose meaning. You can’t say that you’re X rank confidently because your rank fluctuates like crazy. You can’t really say that you “reached” certain rank because of that either. And even though they’re supposed to be more accurate it becomes kinda hard to judge your opponents, 1d could be 4k on a lucky streak, so the only way is to examine history (or play the position and not the opponent, but who does that :stuck_out_tongue:). There’re people who like more conservative systems: yeah, sometimes one plays like 2k, sometimes like 6k, but would be nice if system brushed it off and was showing 4k the whole time.

Say what you want, but from my perspective green line in flovo’s graph appears to describe what’s going on much better than OGS line.

And let’s remember that this picture hasn’t been disproved or confirmed yet. Maybe ranks are supposed to be less shaky in Glicko2 after all.

Or maybe it’s just the nature of go ranks. As we know OGS ranks stand less than 100 points apart. That’s dense.


Could it be confirmation bias? Looking at my history the rating goes up after win against weaker by Resignation. I would need your player id to verify.

I think you’d have to reassess your premise that the 10k>2k walk was

  • accomplished by an honest-to-god 10k
  • due to rating uncertainty

instead of being an extreme outlier that suggests something other than legitimate play is at work (my portmanteau in this case would be… sandbotting). To rephrase my words from somewhere above, instead of doubting Glicko2’s feature (high adaptability), why not doubt the result (“omg 10k can be 2k glicko is useless” > “oh look, this must be a very unique individual”).

Actually my words were something to the tune of ‘It may be annoying to get the impression that you’re making progress when really it’s just Glicko being really enthusiastic about your recent winning streak and showing results immediately - only to get frustrated again when the law of averages kicks in and you’re regressing to your mean performance, but it’s better to be able to see results quickly if you did make progress than having to wait 20 games just because the system is built to err on the safe side (hello IGS)’.

To reiterate, if my 12 years of experience with all sorts of Go players means anything, someone who plays evenly with other 10k does not overnight nor over the course of two months go head-to-head with 2k. Shenanigans.

From my own experience and the games I’ve watched here it is very well possible that someone deviates maybe 2 ranks, yes. But in a game between a drunk 3d on a losing streak and a sober 2k on a winning streak, if they meet at 1d, the odds are not 50/50. That’s why the 3d-1d will bounce back and the 2k-1d will drop again.


FWIW, thanks to @flovo script, I’ve been able to see that OGS algorithm seems to be really self adaptive against abnormal data.
We’ve done a check based just on timeout wins (where, for instance, I undeservedly won against some dan player), but it heartened me about the overall system.
As long as the outcome series doesn’t contain frequent and recurrent anomalies, rating should be able to correct itself fairly quickly.

Maybe I’m incredibly dense, but what did flovo do differently between the red line and the green line? Does it have higher or lower predictive power (since that’s the true point of ratings/rankings?

This here is actually a really serious issue with ratings systems that people do attempt to address (look Whole History Ratings), and is part of what RD is supposed to address (the “this is where we think you are” – not “this is what we think how far you can move”)

Sure, and that makes sense, but keep in mind that most ratings systems are, in a way, statistical theories (which is why they always come in academic papers), and thus need some sort of data to back up their claims.

It could be possible that the OGS implementation is not that great (I’d like to have more details from flovo’s implementation), but your problem either lies with the implementation or the theory, and the battle of theory is not an easy one to fight…

1 Like

In short: green is Glicko2, red is OGS.

In more detail:
The green line is a OGS like Glicko2 calculation. With OGS like I mean, I use the current rating of the opponents, not their rating at the start of the rating period (base rating).
For the red line, I had to artificially raise the deviation of the players base rating at the start of each rating period. I use the deviation provided by OGSs termination-api for the first game of the new rating periode, because I don’t know how they get calculated by OGS (as far as I can tell, it’s an undocumented feature of the OGS rating system).

For both lines I use:

  • rating period = 15 games or 30 days (30 days never applies in this special case)
  • initial player rating = 1500, deviation = 350, volatility = 0.06 (that’s what OGS uses)
  • τ = 0.6 (τ=0.3 or τ=1.2 could I’ve choose as well, they cause no visible difference in my tests)
  • For the current rating of the opponents I pull their rating history and look up their rating 1sec before the “current” rating calculation.

I want to note that after approx 45 games, the deviation doesn’t get lower anymore. It’s always between ≈90 at the start of a rating period, and ≈65 at the end of a rating period. This is a feature common to almost all players on OGS, playing multiple games per month, as you can easily verify here: OGS rank histogram

Glicko2 deviation drops below 34 at the end.

1 Like

I see, that makes sense, and might hint at an underlying problem, especially since with the 15 games (or 30 days) to a rating period, the ratings periods will often be staggered, (which is, as far as I know, no good for the system), and so the green line (while not perfectly in line with glicko-2) causes ratings to be more based on a more current skill.

I always thought the system we had in place for rating periods was a bit wonky.

So I suppose the real question is: does it have a higher accuracy in predicting wins and losses (even if it’s just for you) than the OGS implementation?

I never tested it.

I’ll have to think about how to measure this reliable. (Do you think squared difference is a good measure? [(1 - prop )^2 for win and (0 + prop)^2 for loss])

I’ll also have to find some rating histories with reliable player strength (equally good at all speed / sizes or only playing one) not playing many games against bots.
My own rating history is biased by the fact, that my 9x9 rank is about 13k and my 19x19 arround 17k. I just dropped 5 ranks because I switched to only 19x19 for 2 weeks.
But as I think about it, the rating system should cope with it. Will use my own history until I find a better one.

Will check it. (I can only check against opponent ratings provided by OGS)

Great point, great idea.

It’s interesting, actually, that the ostensible primary purpose of the rating system is to find even matches.

If that is the case, then ^^^ this is the primary question, and it will be fascinating if someone can figure out the answer.

But actually, I think there’s a softer purpose, possibly as important, which is communicating to us how “good” we are in some abstract sense. If that were not the case, we wouldn’t bother converting to Kyu/Dan.

It’s this second purpose that isn’t well served by a system that fluctuates wildly.

It might almost suggest that our K/D ranking should be “smoothed” relative to our Glicko rating at any given time…

Most systems I know of use a direct % of games where the winner was predicted by rating (1 for stronger wins, 0 for weaker wins, and take an average), but I have seen use of Root Mean Square as a measure, either one should work. (personally I prefer the RMS system)

Getting more varied data sets would also be reasonable, especially since we’re talking an entire server of players here.

I’ve first estimations.
I used the opponent ratings as provided by OGS (no recalculation of their rating history).
I only recalculated my rating history with OGS like Glicko2 (again with opponent ratings provided by OGS).

rms is calculated over (outcome - win_probability)
direct % as 1 - Σ(outcome XOR player_rating > opponent_rating) / number_of_games
(outcome = 1 if player win, and outcome = 0 if opponent_win

Prediction quality of my rating history: (rms: lower is better, direct %: bigger is better)
Player id: 449941
Games: 978

algorithm rms direct %
OGS 0.4645 64.93%
Glicko2 0.4637 65.95%

There is no difference in predictability.

To get the right values for Glicko2, I would have to recalculate the histories for the whole player base. At the moment I’m not able to do that (rating history gets cropped to max 5000 games).

EDIT: Forgot to adjust ratings for handicap games in the calculation of the direct %. Now both are slightly higher.

1 Like

Idk about that, 1% increased accuracy seems to be a lot as far as rating systems are concerned…

But yeah, a more in-depth study would be required before making any conclusions.

Turn to page 7.

He is right about it not being significant, though.

From a chi-square goodness of fit test (with OGS predictions being the expected values), the p value is 0.642182, which is very far from significant. Needs more testing before we can reach any conclusions.

p-values are close to useless, it’s more useful to calculate the effect size. :stuck_out_tongue:

P-values are necessary for any type of statistical research unless you want to fall into the gambler’s fallacy. Calculating past effect size is not quite the same as preparing for future data which can and will differ from past data.

This is why we need more research/data before we make any conclusions on the matter, since the 1% increase is suggestive, but not significant enough to be granted as fact.

p-values offer the probability of observing a given set of data (assuming…), effect sizes allow us to compare results of several studies. I wrote a whole chapter on this, hence my bias towards effect sizes. :slight_smile:

You could have a tiny effect even if you had a spectacularly low p, simply due to having a high N (or many, in the worst case even unadjusted-for comparisons… hello fMRI data).

I’m not sure how you can get a good measure of significance from that alone, though…

I’m not sure how to get a reliable measure of value either. Both (ogs and glicko2 predictions) share the same opponent_ratings. Therefor both values share to some extend the ogs-reliability.

I’m not sure how much one should trust even statistical significant results.

Well, it is pretty much guaranteed that we want to continue investigation, but the question of how we should measure the success of our experiment is of undoubted importance