Rank Instability on OGS

I’ve noticed that your rank in particular seems considerably unstable - it’s worth taking a look at why.

To my mind, the answer is that you don’t play a uniform mix of stronger and weaker players.

During a significantly turbulent period of your rating, you were up at the 75% playing stronger players.

The ranking system works best, both for an individual and as a group, when we play an even mix of stronger and weaker.

That’s what the pie chart is for (ain’t that so @Sarah_Lisa :smiley: :stuck_out_tongue: )

GaJ

2 Likes

How many people are going to make sure they play even balance anyway. Especially considering that if you’re weaker, most of the players are stronger than you, so if you don’t particularly care about your opponents, they’re going to be stronger, mostly. And if you’re in upper kyu/dan territory naturally it’s harder to find stronger opponents, especially in tournaments.

If system doesn’t work very well with that, maybe we need better system.

FWIW, I personally take care.

You can’t always chose your next pairing (ladders, tournies etc) but when I can chose, I chose in the direction of balancing it out.

It’s the whole reason I implemented the pie chart: to make this easier.

I think it is a good idea because of:

  • the “abstract” idea that it makes the rating system work better,
  • the “idealistic” reason that I want stronger people to play with, so they need to play weaker people, so if I expect that from them then I should do the same
  • It helps to have some wins as well as losses :slight_smile:

Sure, some people are like “screw you, I am just going to play up some other suckers can play down… and I don’t mind losing it’s the challenge that I care about”. That’s OK - it’s your choice.

But if you do that, it’s a bit rich to complain about unstable rank.

I think there’s a fundamental reason why you have to play up and down to have a meaningful rating system. If people are not compared to weaker as well as stronger players, how can the system know how weak or not they actually are?

If I only play SDKs, and I lose all the time, how does the system know if I am 13k or 20k?

You have to play up and lose and play down and win for any rating system to place you.

GaJ

3 Likes

Hi,

sorry for my raging about my particular unstable rating history. It was off topic since my main point was that the deviation on OGS seems to be to high (compared to other Glicko implementations), not only for my account, but for all player histories I looked at.


In this sense this post goes off topic too, since it is a reply to @GreenAsJade.

I share your “idealistic” idea of playing against weaker and stronger opponents for the same reason. But most of my games are part of a tournament and for some reason there are mostly stronger players than I am.

I modified my script to run a balanced subset of my history (removing extreme strong/weak opponents if needed to get an almost 50% ratio for the rating period). The result I qualitatively the same, showing the same up and down.
It looks like (at least in my case) there is no problem of having more stronger than weaker opponents.

Some other test runs lead to another reason for the big variation:
I’ve a rather large difference in playing strength on 9x9 and 19x19. On 9x9 my rating is roundabout 1400 (ca 15k) and my 19x19 is around 1150 (20k).
Plus I’m not sure if my performance over time is average, but it sometimes feels unstable (good week vs bad week).
I fear, the mix of different board sizes and varying day to day performance leads to my unstable rank.
I will create a new Topic if I’ve done some more analysis.

If you want to discuss this in more detail, I would prefer to do it in a new Topic (to not clutter this topic too much with detailed discussions and speculations)


Just because you asked:

It depends on the rank of the SDK of course :wink:

A 13k against a 9k has a win probability of 24% → No 13k player
A 20k against a 9k has a win probability of 5% → I think Glicko can manage this, but not sure
Against 5k it’s about 8% and 1.5% → Glicko may get the 17k right, but for a 20k I think Glicko has problems to get it right.

I calculated the probabilities with E = 1 / (1 + exp(- (rank1 - rankSDK))) . This is the formula Glicko uses. I set the deviation of the SDK to 0.

The same here. We can discuss this in a different topic of in chat.
If I did something wrong I’ll remove it.

3 Likes

Well, in a series of even games, 20k will never beat 9k…

Yeah - what may happen, though, is that in a series a 20k beats a 9k, and thus becomes 15k. Then they beat them again and become 12k. Then they beat them again and become 9k. Resulting in at least 2 “outlier” looking games.

That’s not weird, that’s just someone ranking up.

What’s wierd is that the same person that ranked up can get wacked back down again…

As witness:

2 Likes

I’m not sure how weird that is…

When your RD is high it’s “easy come, easy go” (although it affects your opponents less), but the more (and more consistently) you play, RD goes down (though, depending on your volatility, there might be a floor for that RD), and your rank is much less unstable.

I’m not so sure what the complaint is… this instability (and estimation of how unstable it should be) was designed into the glicko-2 system for the sake of accuracy.

The weirdest thing about the OGS system, however, is the exponential conversion from ELO/glicko points to rank. I imagine this has to do with the idea that handicap stones are much bigger as you get stronger, but it still seems odd to me (especially when RD is converted into ranks somehow), and this causes weaker rankings to be much more unstable (which, granted, also makes sense given the unpredictability of players at that ranking)

Also, there are many other factors at play: such as play styles, club accounts, how much you’re paying attention to each game, corr play, how often the player plays, and in the example you gave, 2 months is a long time to learn (and unlearn) many different things, switch up play styles, and so on (unless you only play corr), so I don’t know that this makes ranks “too unstable”…

1 Like

One thing I’ve noticed and would like a second opinion on is this:

Almost universally, when I win by resignation against a weaker player, my rank in that individual category (e. g. live 19 x 19) goes down. (My overall rank goes up as it should.) Sure, it might have to do with “batch of ratings” etc., but I do find it suspicious that this seemingly happens every time. Anybody else notice that?

The main reason I find it weird is actually an assumption on my part I guess.

And I guess it’s the same assumption that other people concerned about rank instability have.

It’s the idea that your rank is reflective of your current skill, and that as you play more your skill should increase. And that fundamentally your skill is not something that varies a lot quickly, espcially downwards.

Secondarily is the idea that the uncertainty indicator means something. So if someone has ranked up to 2k +/- 2 then it is surprising that their rank would quickly descend back down to 10k.

If the uncertainty measure was meaningful in the intuitive way, I personally would think that OGS thinks that the minimum the person could be statistically is 4k. 10k is a lot less, how did they suddenly “loose all that skill”?

Either OGS is a lot more unsure about someone’s underlying skill than it lets on, or the measure of skill itself is too instable, or the idea that skill itself is comparatively stable and increasing is wrong.

GaJ

1 Like

It is supposed to be, but it is also a statistics-based tool, and these specific statistics have no care for how quickly your skill changes (and actually, the guy who did Whole History Ratings mentioned skill following something called “Brownian Motion”, which is essentially a random process).

This is where RD comes in, and while RD does mean something, it probably doesn’t mean what you think it does.

RD, or Ratings Deviation, is a kind of standard deviation, so it really means “We’re pretty sure your real skill has a 66% chance of falling into this interval” (unless what they are showing is 2 RDs, in which case it’s a 95% chance). The reason it gets smaller as you play more is that it becomes more and more sure that your rating falls into that category.

The weird part to me (which frustrates me why it even exists) is the +/- ranks conversion. First of all, I have no idea how they’re calculating that, because, like I said, the ELO/glicko conversion to ranks is logarithmic/exponential, so the + side when converted to ranks is smaller than the - side, and so Idk if they’re taking an average or what, but looking at the actual ELO/glicko score for RD is a whole lot more accurate.

This is not an unreasonable thought, but never forget that the RD is not an absolute floor, but rather a reflection of a normal distribution (via standard deviation), so OGS really thinks the minimum score the person could have is -:infinity:, (which I think the rankings floor is 25k), and the max being :infinity:, but it being REALLY unlikely that it falls above or below 5 RDs (at the time of the last game).

Also RD is only supposed to reflect your expected skill interval at the time of your last game!
This is important, because that is the main reason ranks are unstable, and the whole idea of RD and volatility is meant to be a “how much are you expected to change between now and your next ratings period”, and should not be interpreted otherwise.

So, ratings are unstable by design, the main difference between ELO and glicko-1 is that glicko made RD variable (where in ELO it was fixed), and the main difference between glicko-1 and glicko-2 is the inclusion of volatility, which basically created an RD floor for how fast your skill increased and decreased.

I hope that clears up a few things, but yeah, it seems to be mostly a case of misinterpreting the numbers.

I’m not actually sure what’s going on there (assuming it is). Glicko-2 should not allow for that beyond the “batch of ratings” thing, and if it were doing that beyond the “batch of ratings” it would indicate a bug of some sort (but I didn’t know anybody used the individual category ratings…)

4 Likes

I appreciate that RD is a standard deviation.

But experience is showing us that it doesn’t “feel right” because our ranks vary much more than we intuitively expect given that the standard deviation for most well established players is <2 and yet we see (in the example I gave) a person flucuating up by 7 kyu and down again.

Similarly, I think we appreciate that skill moves in a brownian but directed way. Thats why wiggles in our graph are totally expected, but massive fluctuations are not.

1 Like

Complaint against unstable ranks can go like this. When ratings become too unstable, they lose meaning. You can’t say that you’re X rank confidently because your rank fluctuates like crazy. You can’t really say that you “reached” certain rank because of that either. And even though they’re supposed to be more accurate it becomes kinda hard to judge your opponents, 1d could be 4k on a lucky streak, so the only way is to examine history (or play the position and not the opponent, but who does that :stuck_out_tongue:). There’re people who like more conservative systems: yeah, sometimes one plays like 2k, sometimes like 6k, but would be nice if system brushed it off and was showing 4k the whole time.

Say what you want, but from my perspective green line in flovo’s graph appears to describe what’s going on much better than OGS line.

And let’s remember that this picture hasn’t been disproved or confirmed yet. Maybe ranks are supposed to be less shaky in Glicko2 after all.

Or maybe it’s just the nature of go ranks. As we know OGS ranks stand less than 100 points apart. That’s dense.

2 Likes

Could it be confirmation bias? Looking at my history the rating goes up after win against weaker by Resignation. I would need your player id to verify.

I think you’d have to reassess your premise that the 10k>2k walk was

  • accomplished by an honest-to-god 10k
  • due to rating uncertainty

instead of being an extreme outlier that suggests something other than legitimate play is at work (my portmanteau in this case would be… sandbotting). To rephrase my words from somewhere above, instead of doubting Glicko2’s feature (high adaptability), why not doubt the result (“omg 10k can be 2k glicko is useless” > “oh look, this must be a very unique individual”).

Actually my words were something to the tune of ‘It may be annoying to get the impression that you’re making progress when really it’s just Glicko being really enthusiastic about your recent winning streak and showing results immediately - only to get frustrated again when the law of averages kicks in and you’re regressing to your mean performance, but it’s better to be able to see results quickly if you did make progress than having to wait 20 games just because the system is built to err on the safe side (hello IGS)’.

To reiterate, if my 12 years of experience with all sorts of Go players means anything, someone who plays evenly with other 10k does not overnight nor over the course of two months go head-to-head with 2k. Shenanigans.

From my own experience and the games I’ve watched here it is very well possible that someone deviates maybe 2 ranks, yes. But in a game between a drunk 3d on a losing streak and a sober 2k on a winning streak, if they meet at 1d, the odds are not 50/50. That’s why the 3d-1d will bounce back and the 2k-1d will drop again.

3 Likes

FWIW, thanks to @flovo script, I’ve been able to see that OGS algorithm seems to be really self adaptive against abnormal data.
We’ve done a check based just on timeout wins (where, for instance, I undeservedly won against some dan player), but it heartened me about the overall system.
As long as the outcome series doesn’t contain frequent and recurrent anomalies, rating should be able to correct itself fairly quickly.

Maybe I’m incredibly dense, but what did flovo do differently between the red line and the green line? Does it have higher or lower predictive power (since that’s the true point of ratings/rankings?

This here is actually a really serious issue with ratings systems that people do attempt to address (look Whole History Ratings), and is part of what RD is supposed to address (the “this is where we think you are” – not “this is what we think how far you can move”)

Sure, and that makes sense, but keep in mind that most ratings systems are, in a way, statistical theories (which is why they always come in academic papers), and thus need some sort of data to back up their claims.

It could be possible that the OGS implementation is not that great (I’d like to have more details from flovo’s implementation), but your problem either lies with the implementation or the theory, and the battle of theory is not an easy one to fight…

1 Like

In short: green is Glicko2, red is OGS.

In more detail:
The green line is a OGS like Glicko2 calculation. With OGS like I mean, I use the current rating of the opponents, not their rating at the start of the rating period (base rating).
For the red line, I had to artificially raise the deviation of the players base rating at the start of each rating period. I use the deviation provided by OGSs termination-api for the first game of the new rating periode, because I don’t know how they get calculated by OGS (as far as I can tell, it’s an undocumented feature of the OGS rating system).

For both lines I use:

  • rating period = 15 games or 30 days (30 days never applies in this special case)
  • initial player rating = 1500, deviation = 350, volatility = 0.06 (that’s what OGS uses)
  • τ = 0.6 (τ=0.3 or τ=1.2 could I’ve choose as well, they cause no visible difference in my tests)
  • For the current rating of the opponents I pull their rating history and look up their rating 1sec before the “current” rating calculation.

I want to note that after approx 45 games, the deviation doesn’t get lower anymore. It’s always between ≈90 at the start of a rating period, and ≈65 at the end of a rating period. This is a feature common to almost all players on OGS, playing multiple games per month, as you can easily verify here: OGS rank histogram (outdated) - #28 by DVbS78rkR7NVe

Glicko2 deviation drops below 34 at the end.

1 Like

I see, that makes sense, and might hint at an underlying problem, especially since with the 15 games (or 30 days) to a rating period, the ratings periods will often be staggered, (which is, as far as I know, no good for the system), and so the green line (while not perfectly in line with glicko-2) causes ratings to be more based on a more current skill.

I always thought the system we had in place for rating periods was a bit wonky.

So I suppose the real question is: does it have a higher accuracy in predicting wins and losses (even if it’s just for you) than the OGS implementation?

I never tested it.

I’ll have to think about how to measure this reliable. (Do you think squared difference is a good measure? [(1 - prop )^2 for win and (0 + prop)^2 for loss])

I’ll also have to find some rating histories with reliable player strength (equally good at all speed / sizes or only playing one) not playing many games against bots.
My own rating history is biased by the fact, that my 9x9 rank is about 13k and my 19x19 arround 17k. I just dropped 5 ranks because I switched to only 19x19 for 2 weeks.
But as I think about it, the rating system should cope with it. Will use my own history until I find a better one.

Will check it. (I can only check against opponent ratings provided by OGS)

Great point, great idea.

It’s interesting, actually, that the ostensible primary purpose of the rating system is to find even matches.

If that is the case, then ^^^ this is the primary question, and it will be fascinating if someone can figure out the answer.

But actually, I think there’s a softer purpose, possibly as important, which is communicating to us how “good” we are in some abstract sense. If that were not the case, we wouldn’t bother converting to Kyu/Dan.

It’s this second purpose that isn’t well served by a system that fluctuates wildly.

It might almost suggest that our K/D ranking should be “smoothed” relative to our Glicko rating at any given time…