Yet another ratings thread

Here is a good thread about the same question:

In specifically this post Anoek confirms the log scale is emperically fitted:

This thread explains the current system:

Specifically here is Anoeks reasoning for keeping the log scale:

5 Likes

Thanks. Good collection of relevant links there.

Still think this rank scale/rating scale miss match can be causing the OGS rating system to feel uncomfortable however.

Thinking about a player at a specific rank boundary they are in the Glicko2 sense closer to ranking down than to ranking up. I would say they have more downside risk than upside risk as players -1 rank different are closer to them than players +1 rank different are.

I think there was some discussion about that previously as well, I think @gennan had some proposals on alternate ways of displaying ranks

Just in case any of them appeal to you over the other or the current way.

In a vacuum that could be true.

But the OGS ranks are calibrated to EGF and AGA ranks. That means that however those work, they in effect fit the log scale. Moving away from the log scale will not fix anything, but it will definitely lose that pairing to the EGF and AGA.

But perhaps it’s more illustrative how moving a linear scale would work. Lets for example pin the baseline of the current ranking (30k rank) and an accurate top level ranks (3dan, everything higher is probably inaccurate due to few players). Then create linear / equal size “bins” for ranks, then that would mean that:

  • People who are now 3d stay 3d (anchored)
  • People who are now 10k become 15k
  • People who are now 15k become 20k
  • People who are now 30k stay 30k (anchored)
Or in full table form
Current rank Current rank index Current OGS rating New rank index from fixed rating New rank Shift in ranks
30k 0.00 525.0 0.00 30.0k 0.00
29k 1.00 548.2 0.47 29.5k -0.53
28k 2.00 572.4 0.97 29.0k -1.03
27k 3.00 597.6 1.48 28.5k -1.52
26k 4.00 624.0 2.02 28.0k -1.98
25k 5.00 651.6 2.59 27.4k -2.41
24k 6.00 680.3 3.17 26.8k -2.83
23k 7.00 710.4 3.79 26.2k -3.21
22k 8.00 741.7 4.43 25.6k -3.57
21k 9.00 774.5 5.10 24.9k -3.90
20k 10.00 808.6 5.79 24.2k -4.21
19k 11.00 844.3 6.52 23.5k -4.48
18k 12.00 881.6 7.28 22.7k -4.72
17k 13.00 920.5 8.08 21.9k -4.92
16k 14.00 961.2 8.91 21.1k -5.09
15k 15.00 1003.6 9.78 20.2k -5.22
14k 16.00 1047.9 10.68 19.3k -5.32
13k 17.00 1094.2 11.63 18.4k -5.37
12k 18.00 1142.5 12.61 17.4k -5.39
11k 19.00 1192.9 13.64 16.4k -5.36
10k 20.00 1245.5 14.72 15.3k -5.28
9k 21.00 1300.5 15.84 14.2k -5.16
8k 22.00 1357.9 17.01 13.0k -4.99
7k 23.00 1417.9 18.24 11.8k -4.76
6k 24.00 1480.5 19.52 10.5k -4.48
5k 25.00 1545.8 20.85 9.1k -4.15
4k 26.00 1614.1 22.25 7.8k -3.75
3k 27.00 1685.3 23.70 6.3k -3.30
2k 28.00 1759.7 25.22 4.8k -2.78
1k 29.00 1837.4 26.81 3.2k -2.19
1d 30.00 1918.5 28.46 1.5k -1.54
2d 31.00 2003.2 30.19 1.2d -0.81
3d 32.00 2091.6 32.00 3.0d 0.00
4d 33.00 2183.9 33.89 4.9d 0.89
5d 34.00 2280.3 35.86 6.9d 1.86
6d 35.00 2381.0 37.91 8.9d 2.91
7d 36.00 2486.1 40.06 11.1d 4.06
8d 37.00 2595.9 42.30 13.3d 5.30
9d 38.00 2710.4 44.64 15.6d 6.64

As a result, many ranks go down, unrealistically so. If you pick different anchor points (20k and 1k for example) etc, the middle kyu ranks stay more like they are, but they would still end up a couple of ranks lower. Current 9dans would deviate more and become future something like 40dans.

So as a result, moving from a log to linear system wouldn’t fix anything. Since there is already a trend of “rank deflation”, it would only make this worse.

I don’t know exactly what you percieve as “uncomfortable” about the current rating system, but the ranks being translated on a log scale isn’t fundamentally wrong.

1 Like

The issue (if there really is one) is that there is a log mapping between the two ratings, not where the specific boundaries are between ranks.

I’m not sure its deflation necessarily, but I think the log scale between the systems introduces greater variance on the down side of a players rank. Maybe this just makes it easier to find a player who is down on their luck presently so it looks like over time a lot of players have reduced ranks.

The reason the log scale might introduce down side variance is frequently matches are played +/- a few ranks from the current rank. The players who are minus these few ranks are closer in Glicko2 than the players who are plus a few ranks. Glicko2 is the best estimate of the win rates expected between players on this server, the players rank number is only related to that, so thinking about the average player on the server we can imagine someone playing as many matches up a rank as down a rank over time. In this case the likelihood of those down a rank over performing is slightly higher than the likelihood of those up a rank underperforming. Another description of the same thing would be to consider a player who is 50%/50% win/loss at their rank. If their skill goes to 70%/30% win/loss at their rank maybe we call them a rank stronger, but as their rank moves up further, they need to increase this margin more and more for that to be called a rank of improvement.

Across the whole server this could look like deflation or lots of players playing below their rank, which is also frequently observed of the rating system.

I’m sorry, I’m not sure I follow.

Matchmaking is purely based on Glicko ratings, not on kyu/dan ranks. So matchmaking between people is exactly as fair as matchmaking can be, kyu/dan ranks and logs don’t factor in to this.

Win chance indication isn’t a goal of Go ranks. There is no rule (or expectation?) that says that 6 kyu players must win, for example, 40% of their games from a 5 kyu player, and that 15 kyu players must also win an equal 40% of their games against 14 kyu players. It seems that this “fairness” is what you’re advocating for? Only glicko ratings are an indication of this, and if you want to see those, you can.

Handicaps are the only place the “fairness” of the log scale can become an actual issue, in both matchmaking and win indication, because handicaps are calculated based on Go ranks rather than Glicko ratings, and as such they go through the log calculation. But as the other threads and my calculations above show, real human Go ranks follow a non-lineair translation of Glicko rating to kyu/dan ranks. So here a log scale is more fair or indicative of equal win-chance than a linear scale.

2 Likes

Think it depends what you mean in matchmaking terms. If I create a challenge with a rank range then the ranks will definitely be part of matchmaking and since they are on a log scale the higher rank side to the mid point of that rank bracket of that will be marginally wider than the lower rank side in Glicko2 width.

My point about the win percentages can be illustrated similarly, starting at any round rank level translate that to Glicko2 (via the log scale) now draw the gaussian shaped probability weight function that Glicko2 gives for the players actual rank, also mark out rank boundaries above and below (via the log scale). It is the case that Glicko2 thinks it’s more likely the player is a rank below rather than a rank above where they are ranked (the weight is higher at this rank point). That’s what I mean in terms of players playing below their rank, by another description. In other words if you switch a players rankings to kue/dan mode the ± value on those is a bit wrong (the + side is slightly smaller than the - side), if you switch it to Glicko2 mode its right.

In my observation its more likely the case that most kue/dan ranking systems there is not 1 stone difference between ranks across the rating scale. Smaller mistakes become much more often decisive at higher levels of play, while the handicap score difference remains fixed, so this should be obvious. In fact, if your rankings are not decided in handicap games then you need to go with win rates to determine rank (e.g rating) changes. The other thing about handicap is it shifts the gain/loss point for ratings between the players involved (a player who loses by 2.5 points with white at 1 handicap effectively won that game, they just didn’t quite perform up to the difference in rank which induced the handicap). This seems to fundamentally align better with a Fox type ranking system (I’ve seen it described as a league system) where you can be playing at an advantage against slightly higher ranks and then rank up and it becomes a question of can you now play even with those ranks. Even on Fox however win rates in even games are obviously more influential than handicap games, so there probably isn’t a 1 stone difference between ranks on Fox either.

Overall, I think it’s obviously fine for people to be playing handicap and rated handicap games, just the expectation that there is a 1 stone difference between ranks seems unrealistic and may be not aligned with how players find matching with players a rank up or rank down (even systems which describe players as playing at kue/dan ranks).

I think if there were no handicap games and no alignment to EGF/AGA, ranks would probably just be a cosmetic thing, they could be equally spaced (every 100 rating points) or whatever you want.

Because one would want to match chance of winning with n stone handicap with certain differences in rating, you get certain intervals carved out like the logarithmic mapping and we’ve ended up with unequal spacing. Then there might be some shifting and tweaking to match EGF/AGA around 1dan level.

Probably that part is fine in theory.

One possible way deflation (or inflation) might come about is from how you update the players ratings after a handicap game.

Let’s say you have B is the weaker player and W is the stronger player. The proper handicap is the rank difference n=(rankW - rank B), rounded and and transformed, kyu dan into some integer etc,

The fact that the game is even (ish) is supposed to be like B with the handicap is roughly n ranks stronger from W’s perspective, or W while giving the handicap is n ranks weaker from B’s perspective (it’s a bit funny because of the 1 stone handicap not really being a full “stone” - something I think about how 6.5 to 0.5 is only half of passing).

You can try to adjust the opponents rating to include the handicap, so if B wins, they didn’t beat W, but W reduced by a certain amount of rating. Same with W, they didn’t beat B, they beat B plus a certain amount of rating, to kind of adjust for the skill needed to win while giving certain handicap.

I think the logic I’ve seen is to say take B’s rating, turn it into a rank, add on the handicap to make an effective rank, and then convert back to rating. Similar with W but you subtract.

I think a lot depends on the details on how that’s done, if using integer ranks, then maybe there’s some rounding down going on, or because the rating gaps change at different ranks, maybe with larger handicaps, if one was doing a rating shift you could be shifting by too much or too little rating - I haven’t really sketched all that out, but I think I’ve seen it discussed as a possibility elsewhere.


In terms of deflation it has to be with respect to certain reference points, like comparing a player to themselves a number of years ago and being sure they’re playing at at least the same strength or better, but their ranking has gone down.

Or since we have the alignment with EGF/AGA, if a typical 1d from either association tends now to be closer to 1,2,3k etc, one would notice deflation.

It might come about if for example on average, you’re not getting as many points as you should for a win, or losing too much for a lost game. One would expect that for two players with equal skill that ratings would roughly oscillate around their actual level, kind of like coin flips, sometimes you get a streak and go up, but it’s harder to maintain that streak and there’s a correction of a harder penalty which brings you back down etc.

But if things aren’t well balanced and you’re rewarded a bit too much for wins, or not enough for losses, one can imagine a drift in ratings over time.

So I can imagine that if say one is over rewarded or under penalised in taking handicap in a game, compared to your expected winrate on average, your rating could drift up. Or the opposite, if you are over penalised for losing while giving handicap, or under rewarded for winning on average, over time your rating could drift down.

And if you don’t play handicap games, if other people your level do, and tend to drift down it could affect you.

(I’m focusing on handicap since that’s the topic, there could be other ways of having inflation/deflation, but just trying to sketch out a particular related direction)

I don’t quite know the win probability for two players with Glicko2, I’ve seen a formula’s for Glicko, but there’s a few changes, like a scale change and the volatility that I don’t know how they might factor in.

3 Likes

Just to make it explicit, my post you responded to was about the log scale, but I follow your move from the “log scale” topic back to the “handicap” topic.

I see your point. I think you are correct in that if the current handicap system overvalues “handicap 1”, then so Black-with-handicap-1 gets too high expected win chance, so their losses are over-penalised.

This would lead to people playing with Black more often not leveling up as fast as they should.

My first thought was: This should even out because there are as many players playing White as they are playing Black.

But in reality the pool of Black-handicap-takers is on average lower in rating/rank than the pool of White-handicap-givers. So in effect rating points flow from weaker players to stronger players more easily than the other way around.

This effect would be perhaps amplified by bots, but those games are now unranked, so we can ignore potential bot effect.

The way to be certain that this occurs, and how much effect it has, is to take the general set of handicap games, and compare the expected winrate of black with the actual winrate of black.

1 Like

I think there is an argument that all handicap matches over reward/over penalize to a slight extent. Take a 1 stone handicap match as this doesn’t involve any extra stones on the board (I think the idea extends to more stones). When the game ends with White up to komi points behind then the result goes for Black in the handicap game (Black’s rating goes up, White’s goes down) while if the same game happens with no handicap, then exactly the same game results in (White’s rating goes up, Black’s goes down). Thinking about these results as evidence towards what Glicko2 should estimate a players rating to be, then both games seems to indicate the Black player is about 1 stone weaker than the White player in both cases, e.g both these players are playing at their rank according to how they placed their stones on the board. Since the Glicko2 estimate of the Black players rank increases as a result of the handicap game however there must be a slight over reward accruing to the Black player going on here (and a symmetric over penalize of the White player).

The same idea extends to more stones by implying the Black winner at handicap is up to the handicap stones weaker than the White player when they win.

Glicko2 by itself has no option for handicap stones. It just has ratings and other values. For handicap games, OGS calculates “effective ratings” which are supposedly a fair representation.

I think you assume that OGS treats “handicap 1: Komi 0.5” the same as “one stone / one rank” strength difference. But that is not the case. Handicap 1 is counted as a 0.46 rank difference on 19x19 with territory scoring, and 0.54 rank difference under area scoring.

Whether that is good or too little / too much can be debated of course (with analysis).

1 Like

Agree with all the details of that.

What i am describing probably doesn’t have a fix in rated handicap.

What i am saying is Glicko2 wants to increase Whites rating and decrease Blacks rating at even, so it must be over rewarding/over punishing to decrease Whites rating and increase Blacks for the exact same game at 1 handicap.

Just want to point out that historically, the “handicap stones” based teai system was “winrate-based” and wasn’t linked to the ranking system. Ancient sources from China described different “pin” of ranks based on the quality of their plays, not the winrate.

And the dan ranking system developed in the 16th to 17th century, right before the Japanese Edo period, was aimed to solve a very similar issue, where they already had a 3-tiered system, upper/middle/lower strength players (thought of them as like 7-9, 6-4, 3-1 dan where you had to obsoultely beating those below you with 2 handicaps to be recognize as “next teir”). They also didn’t know how to rate a player if one player got inconsistent winrates against different handicaps against a variety of players who weren’t organized into “ranks”. The more detailed dan ranking system didn’t just “map” to different “winrate with handicaps”. A player didn’t just automatically advance to the next rank after a series of decisive games, the ranks are quantized and capped at the bottom by design. Historical records showed that when a player accumulated enough winrate with handicaps, they got “recognized” as a certain strength, but had to wait for the approval of the Great Houses (not just for the politics) before actually getting their ranks. And their teai would change even though their ranks had not. This probation period, when their ranks hadn’t been properly adjusted, ensured each rank had enough representatives and anchored each rank.

There is a reason why they didn’t demote players with ranks, the upper cap of Meijin and semi-Meijin (9 dan, 8 dan), to ensure the ranks don’t just inflate (which happened in the 1960s with oteai where lots of 9 dans were created when pros still used winrate based scoring to advance to the next rank, but no upper cap in ranking), but also the floor of each rank with the quantized ranks, so ranks didn’t deflate. The ranks weren’t just a pretty number on a diploma, but had a real function (in combination with the Great Houses’ higher-level concessions and consensus to ensure a group of strong and senior players judging the quality of players).

2 Likes

I interpret the words over rewarding/punishing as: too much. If that is what you mean, then my response is that you can’t say that without statistical analysis.

If you mean over as in “more than without handicap”, then yes. Glicko doesn’t do this, but OGS “hacks” this into the rating system, by letting glicko think that the player playing Black with handicap has a higher rank than they actually have.

The way systems like Elo and Glicko work is that for a given matchup there’s an “expected value”, the average predicted result. For a game with only wins and losses like Go this is just the predicted winning percentage. Say this expected value is 0.60 (Black has a 60% chance of winning). If Black wins (scores 1.00), they get a rating reward proportional to how much they overperformed (0.40 in this case) and White gets a rating penalty proportional to how much they underperformed (0.60 in this case). Of course there are more subtleties but this is the heart of it. This expected value depends both on the players’ rating difference and the parameters of the game, such as komi and handicap.

Say that our stronger player W and weaker player B play a thousand games. Just making some numbers up, let’s say that if they play with 0.5 komi W is expected to win 50% of the time; if they play with 6.5 komi W is expected to win 60% of the time. So for 10% of their games the result switches whether they use komi or not. Glicko understands this; the expected values for the two kinds of matchup are different.

If the games were played without komi, Black would win 500 of the games (overperforming by 0.50) and lose 500 (underperforming by 0.50), so it would all even out and their rating wouldn’t change.

If the games were played with komi, Black would win 400 of the games (overperforming by 0.60) and lose 600 (underperforming by 0.40), so it all evens out again.

The analogous calculations work for White.

[Edited because I rushed the first time and got some of the numbers backwards]

1 Like

In the original teai-wari (using match sequences for finer match handicap gaps), they switch from “even” (take turns to play white and black, tagai-sen 互先) to B-W-B B-W-B (sen-ai-sen, 先互先, effectively 2 black 1 white), then one side plays as black (josen, 定先).

So they expected the gap of 1 dan rank difference to be winning 1/3 of them (33% 67%). And the math is pretty simple and easy, if you played as black without komi and only expected to win 50/50, then playing B-W-B, where you have very little chance to win as white, leads to expected winning games slightly more than one out of three. You need about 20% of winning rate as white to reach 40% winrate for B-W-B sen-ai-sen (with 50% winning chance as black). And a player with 20/80 as white and 50/50 as black playing even matches as tagai-sen B-W, would expect to win 35% of them ((0.2+0.5)/2).

From historical records, though, we saw the games amount roughly equal strength players without komi as black had a higher win rate than 80% (somewhere around 5 in 6 to 6 in 7), This 1 to 2 ratio basically defined the jobango format and its predecessor, 20-game challenges, where one side had to win 4 more games to “rank up” (change teai from even to B-W-B, then B-W-B, to just B). In jobango, it would be 7-3 (30% win rate), and for 20-game, it would be 12-8, that is 40%, however, since when one side accumulated 4 straight wins would changet teai, the expected winning percentage is definitely higher for 1 rank difference, likely close to the jobango 30%, and almost exactly 1/3.

2 Likes

From above,

“Handicap 1 is counted as a 0.46 rank difference on 19x19 with territory scoring, and 0.54 rank difference under area scoring.”

So White is still expected to win more than 50% against Black even with 1 stone handicap.

Basically with komi, accepting 1 stone handicap and giving 1 stone handicap are 3 slightly different Go variants and if you don’t separate them in the Glicko2 analysis then the ELO/Glicko2 system must be slightly off when rating them all together.

Another implication of OGS using log mapping from rating to rank is the expected win percentage goes up across each rank band. I’ve calculated the win percentage expected for the stronger player to maintain their rating against a player 1 rank lower (say in a many game match), calculations based on the ELO system (which I believe for these purposes is the same as Glicko2).

Read the table as 24k vs 25k has a 54.73% win rate to maintain 1 rank gap in ratings.

Rank Win Rate
25k
24k 54.73%
23k 54.73%
22k 55.02%
21k 55.16%
20k 55.30%
19k 55.45%
18k 55.59%
17k 55.87%
16k 55.87%
15k 56.30%
14k 56.30%
13k 56.58%
12k 56.86%
11k 57.01%
10k 57.29%
9k 57.43%
8k 57.85%
7k 57.85%
6k 58.27%
5k 58.55%
4k 58.69%
3k 59.11%
2k 59.25%
1k 59.66%
1d 59.94%
2d 60.22%
3d 60.63%
4d 60.90%
5d 61.18%
6d 61.59%
7d 61.99%
8d 62.27%
9d 62.80%
1 Like

Hum, not sure I can feel by myself what is the difference in a chance of 57,85 and one at 60,22 ?

1 Like