Rank Instability on OGS

I wrote and meant exactly “how much your rank changed” and not “how much your rank would change”. After the game is finished it’s as simple as comparing two numbers. But it would allow for people who play tons of games see all weird things Glicko does + maybe some kind of common irregularities.

OGS is not very transparent. Try finding out what counts as live and what as blitz. Not trivial at all.

1 Like

Actually even that is still dependant on all those factors, so does not really give any insight in if the system works as is supposed.

I don’t think corr, live, and blitz are weighted differently, are they? When I talked about blitz I was referring to how it’s both easier to surmount a strength difference and quicker to finish a large amount of games, resulting in a significant compound effect.

1 Like

That’s all I can do, too… guessing… but it isn’t very scientific, is it?

I really appreciate everybody’s effort to guess and justify, but my logic mind would prefer solid data.
As an example when are batches of games recalculated? The first of month? When it’s full moon? On a random basis? This could help to check changes.

@GreenAsJade, could you help? You are able to modify the site code, so you should be able to identify the function (i used to call them “subroutines” when I was young) that does this job. Is it possible? Could we look at the code? Perhaps the answers to my doubts are in plain sight and I just don’t know where to look…

1 Like

Well, the thing is that only the “client” side is open source. So I can read and contribute to the way the site looks, and the interface (buttons, where they are etc) but not to the underlying “fundamental way it works” (aka server side code).

How rating works is “fundamental way it works”.

GaJ

3 Likes

Hi @AdamR

There could be an issue with the way, how OGS calculates the deviation of the players ratings. This results in high fluctuations of the players rank.

I found this, because I wanted to understand Glicko2 in deep and to get a better insight in how OGS implements it.

In the plot below, you can see my rating as shown by OGS in the rating history (upper black line)
and the ratings as calculated by me (upper green line).
The lines on the bottom of the graph are their deviations.

The red lines are a calculation with my implementation, but forcing the deviation to be the same as OGS ones.

As you can see, the green line is much more stable than the OGS rating, while the red line follows OGS, while the deviation of the OGS data is much higher (about 2 times, didn’t calculate the exact number).

I hope this helps to pin down the maybe issue with the unstable ranks. Sorry for the necro.


Implementation details:

I implemented Glicko2 as described in the Glicko paper http://www.glicko.net/glicko/glicko2.pdf. I tested my implementation against this one https://github.com/sublee/glicko2 and found no errors.
All OGS-data (player ratings, opponent ratings, game results) I used are the ones in https://online-go.com/termination-api/player/449941/rating-history.

If you need more details, just ask. If you need them, I can also send you my python scripts, too.

13 Likes

^ Fascinating! Thank you for your efforts! :slight_smile:

1 Like

Can you illustrate this in (pseudo-)code for the two formulae? Because “but forcing the deviation to be the same as OGS ones.” does not explain a great deal.

Sorry for that.

I’m iterating over the game history, early -> latest. Finalizing a rating period every 15 games. i is the index of this iteration.
OGS_deviation is the deviation as listed in https://online-go.com/termination-api/player/449941/rating-history
The sliced arrays ([a:b]) include the elements with index a to including b.

For the correct calculation (only line 4 differs)

begin_rating_period = i - i mod 15
// obtain the last finalized values
last_finalized_rating = my_calculated_ratings[begin_rating_period - 1]
last_finalized_deviation = my_calculated_deviation[begin_rating_period - 1]
last_finalized_sigma = my_calculated_sigma[begin_rating_period - 1]
// calculate the new rating
my_calculated_ratings[i], my_calculated_deviation[i], my_calculated_sigma[i] = 
     Glicko.rate(last_finalized_..., OGS_opponants_rating[begin_rating_period:i], OGS_opponants_deviation[begin_rating_period:i])

For the “forced” deviation (only line 4 differs)

begin_rating_period = i - i mod 15
// obtain the last finalized values
last_finalized_rating = my_calculated_ratings[begin_rating_period - 1]
last_finalized_deviation = OGS_deviation[begin_rating_period]
last_finalized_sigma = my_calculated_sigma[begin_rating_period - 1]
// calculate the new rating
my_calculated_ratings[i], my_calculated_deviation[i], my_calculated_sigma[i] = 
     Glicko.rate(last_finalized_..., OGS_opponants_rating[begin_rating_period:i], OGS_opponants_deviation[begin_rating_period:i])

The plot shows the calculated values after each step.

3 Likes

You beat me to it, I was planning to do exactly the same after I’m back from holiday in a few days.

1 Like

For super dumb people. How can one recalculate ratings deviations at all if you need to know ratings/calculations of several (~15 players, I think?) last opponents at the time game ends? rating-history seems to give only numbers of the last opponent.

1 Like

That’s a rather good question.

I used the opponent ratings of the previous games (rows below). That was the easiest way and the results are good enough to start with.

Following your suggestion, I modified my script to use the “current” opponent ratings by looking at the rating history of each opponent.
The new plot is indistinguishable from the old one. Therefor I’ll leave it as it is.


It is even worse. The rating is the rating after the game ended. It is calculated including the result of the listed game.

2 Likes

Hello :slight_smile:

very interesting and my deepest thanks for all the work put into this.
Unfortunately this is way above my understanding of the system, so I will be unable to discuss, let alone change anything :smiley: but I will relay this onto our devs.

They will probably be unhappy about trying to tinker with the system as any deep changes are surely dangerous and (as far as we are concerned) the system works “well enough”, but if there really is some underlying issue, I am sure they will eventually try to fix it, and maybe contact you for more details :slight_smile:
Thanks again.

5 Likes

Thanks for your reply.

I can understand them. Changes on the rating code would probably change the rating distribution, but

“well enough” is the right choice of word I guess :wink: I wouldn’t use “well”, but “well enough” is quiet fitting.
It’s at the borderline to ill behalf, but on the side of well.
My rank is “stable” between 12k and 18k. On the ladders I’ve to look if the person I want to challenge is on the lower or the upper end of their 5k interval.

Therefor I would appreciate if the ranks were more stable (behave well so to speak), but I can understand if the devs decide that changing/fixing this would be not worth the time.
I can work around the instability/fluidity of the ranks. It only get sometimes annoying.

Thank you for your good work.

2 Likes

I’ve noticed that your rank in particular seems considerably unstable - it’s worth taking a look at why.

To my mind, the answer is that you don’t play a uniform mix of stronger and weaker players.

During a significantly turbulent period of your rating, you were up at the 75% playing stronger players.

The ranking system works best, both for an individual and as a group, when we play an even mix of stronger and weaker.

That’s what the pie chart is for (ain’t that so @Sarah_Lisa :smiley: :stuck_out_tongue: )

GaJ

2 Likes

How many people are going to make sure they play even balance anyway. Especially considering that if you’re weaker, most of the players are stronger than you, so if you don’t particularly care about your opponents, they’re going to be stronger, mostly. And if you’re in upper kyu/dan territory naturally it’s harder to find stronger opponents, especially in tournaments.

If system doesn’t work very well with that, maybe we need better system.

FWIW, I personally take care.

You can’t always chose your next pairing (ladders, tournies etc) but when I can chose, I chose in the direction of balancing it out.

It’s the whole reason I implemented the pie chart: to make this easier.

I think it is a good idea because of:

  • the “abstract” idea that it makes the rating system work better,
  • the “idealistic” reason that I want stronger people to play with, so they need to play weaker people, so if I expect that from them then I should do the same
  • It helps to have some wins as well as losses :slight_smile:

Sure, some people are like “screw you, I am just going to play up some other suckers can play down… and I don’t mind losing it’s the challenge that I care about”. That’s OK - it’s your choice.

But if you do that, it’s a bit rich to complain about unstable rank.

I think there’s a fundamental reason why you have to play up and down to have a meaningful rating system. If people are not compared to weaker as well as stronger players, how can the system know how weak or not they actually are?

If I only play SDKs, and I lose all the time, how does the system know if I am 13k or 20k?

You have to play up and lose and play down and win for any rating system to place you.

GaJ

3 Likes

Hi,

sorry for my raging about my particular unstable rating history. It was off topic since my main point was that the deviation on OGS seems to be to high (compared to other Glicko implementations), not only for my account, but for all player histories I looked at.


In this sense this post goes off topic too, since it is a reply to @GreenAsJade.

I share your “idealistic” idea of playing against weaker and stronger opponents for the same reason. But most of my games are part of a tournament and for some reason there are mostly stronger players than I am.

I modified my script to run a balanced subset of my history (removing extreme strong/weak opponents if needed to get an almost 50% ratio for the rating period). The result I qualitatively the same, showing the same up and down.
It looks like (at least in my case) there is no problem of having more stronger than weaker opponents.

Some other test runs lead to another reason for the big variation:
I’ve a rather large difference in playing strength on 9x9 and 19x19. On 9x9 my rating is roundabout 1400 (ca 15k) and my 19x19 is around 1150 (20k).
Plus I’m not sure if my performance over time is average, but it sometimes feels unstable (good week vs bad week).
I fear, the mix of different board sizes and varying day to day performance leads to my unstable rank.
I will create a new Topic if I’ve done some more analysis.

If you want to discuss this in more detail, I would prefer to do it in a new Topic (to not clutter this topic too much with detailed discussions and speculations)


Just because you asked:

It depends on the rank of the SDK of course :wink:

A 13k against a 9k has a win probability of 24% → No 13k player
A 20k against a 9k has a win probability of 5% → I think Glicko can manage this, but not sure
Against 5k it’s about 8% and 1.5% → Glicko may get the 17k right, but for a 20k I think Glicko has problems to get it right.

I calculated the probabilities with E = 1 / (1 + exp(- (rank1 - rankSDK))) . This is the formula Glicko uses. I set the deviation of the SDK to 0.

The same here. We can discuss this in a different topic of in chat.
If I did something wrong I’ll remove it.

3 Likes

Well, in a series of even games, 20k will never beat 9k…

Yeah - what may happen, though, is that in a series a 20k beats a 9k, and thus becomes 15k. Then they beat them again and become 12k. Then they beat them again and become 9k. Resulting in at least 2 “outlier” looking games.

That’s not weird, that’s just someone ranking up.

What’s wierd is that the same person that ranked up can get wacked back down again…

As witness:

2 Likes