2020 Rating and rank tweaks and analysis

I explained the format up above in a previous post, but its’ handicap 0, 1, 2, etc… with the format Actual black win rate % : Expected black win rate % [n = number of games]

3 Likes

“I think the EGF stats are correct, and the OGS ones are wrong. White is expected to win more in traditional handicap games because they are on average 0.5 stones less than what would create a 50/50 game.”

I think you might not be quite understanding the statistics. Either that or I’m not correct on how they are collected anyway. I am under the assumption that the collected data is not on games of H-0.5, they are games where the handicap matches the rank difference (and therefore at H-0), and therefore on the basis that the handicap stones are equivalent to a single rank at all ranks, one would expect a 50/50 win rate (an assumption that probably only holds true where the EGR/Glicko rating is not a linear X points per rank across all playing strengths, as there seems to be a general belief across stronger players that 3 stones at dan ranks is more significant than 3 stones at DDK ranks - FWIW, ranks in the EGF are basically 100 EGR points = 1 rank, which means I suspect the handicap calculations are likely to be off at least at one end of the scale, and possibly both ends)

Maybe this does create a new and interesting question. If the argument is that high strength EGF players only play handicap games at a slight advantage for White (handicap is always half a stone less than “true handicap”), why is this? Is this a philosophical belief that having the stronger player only have a 50% chance of winning is some way unfair? Does the average player playing on OGS or elsewhere expect that when a handicap game is automatically created that the game should favour the White player slightly? It’s the first I’ve come across this, but maybe this is an institutional bias we have?

My limited discussions and observations of interviews by players from an Eastern Go background is that handicap stones are primarily used in teaching situations and not in competitive games anyway, on the basis that they make the games more instructive. Some exceptions to this are things like the jubango games, or Korean games-for-money where I’ve heard of handicap stones being treated sort of like a Contact Bridge bid, but those are somewhat special circumstances in both cases.

Obviously whether OGS decides that handicap-0.5 is a more appropriate setting is separate to the issue of “correctly calculating appropriate handicaps by rating difference”, but it does make it increasingly more complex to calculate expected win rates.

Interestingly, and I don’t know of anyone who has done similar work with Go, but a statistician named Jeff Sonas did a lot of very compelling work over time demonstrating that, in Chess, the default expected win rates according to ELO calculations were actually very suspect, In particular this was true at the higher ratings where he had the most data and did most of the work, and found that expected win rates were much closer to linear than to the ELO distribution. www.chessmetrics.com has quite a lot of his work still, but I can’t find a few of the detailed distribution articles he published on there.

3 Likes

My teachers are all dans in real life also some in my local go club.

1 Like

Some (but not all) egf tournaments that use handicaps (not all of them even this) have MMS-1 (McMahon Score(difference) -1) to determine hc, meaning that you often see 5k vs 7k having just the reduced komi of 0.5 without any extra stones. Or you see 5k with 0 wins playing even game with 8k with 2 wins on round 3 of a tournament.

To make it even more confusing, some tournaments have a limit where handicaps are applied. Like example dans playing even games while kyus using handicaps based on mcmahon points.

Also sometimes tournament doesnt have enough ddk’s for providing the very lowest ranked players enough opponents that are within 9 stones from them. So you have 9hc games between players that are lot further apart from each others (and this happens on ogs too - tournaments with random/slide/slaughter pairing create ranked games between players over 9 stones apart)

The point is, be very careful when analysing results from handicapped tournaments.

6 Likes

I just got upranked after a loss. Personally, I don’t mind losing rank after a win (it’s just glicko working as supposed) but it would be more consistent if it would go the other way also.

2 Likes

You can correct for half a stone and the EGF does this.

For example, say you’re a 7d EGF playing against a 6d EGF on even. The difference between those players is 100 GoR, which is about 200 Elo in that range, which means the 7d is expected to win 75% of the games.

When the 6d gets black without komi, it only compensates for half a rank. So the difference is still 50 GoR, which is about 100 Elo in that range, which means the 7d is still expected to win 65% of the games.

Another example: a 19k plays against a 20k. On even, the difference between these players is 100 GoR, which is about 35 Elo at than range, which means the 19k is expected to win 55% of the games.

When the 20k gets black without komi, it compensates for only half a rank. So the difference is still 50 GoR, which is about 17.5 Elo in that range, which means the 19k is still expected to win 52.5% of the games.

Same again but now for 2d vs 1d (100 GoR is about 100 Elo in that range):
even game (100 Elo) => 65%
no komi (50 Elo) => 57%

1 Like

“You can correct for half a stone and the EGF does this.”

There are many factors that give cause to question the accuracy of any correction. Firstly, it depends on the formula, hence the second half of my previous post. This is then complicated by the fact that correction per hundred points in rating difference is static in ELO (it’s more complicated with Glicko2), and that rank difference is the same as rating difference across the rating distribution. 100 points equates to 1 rank more or less across the spectrum on the EGD (at least 20k to 7d), whereas on OGS it doesn’t equate to 1 rank across the rating distribution. It also goes on the basis that half a handicap stone is the equivalent of 50 GoR regardless of the strength of players - even based on the premise that the lack of linearity for ranks on OGS is designed to make “1 rank = 1 stone”, there is now still a very legitimate question on whether “if 1 stone = 100 GoR at EGF 2300 and 1 stone = 60 rating points at OGS 1500 accurately handicaps a 1 rank difference, does 4 stones = 400 GoR at EGF 2300 and 4 stones = 240 rating points at OGD 1500 still hold true for players with 4 ranks difference?”

Essentially of course you’re correct, particularly if you’re fairly strictly following ELO as a gold standard, it’s actually very easy to model corrections, and this is exactly why I used ELO for OGS at the beginning. It’s a rather crude instrument as far as predictive algorithms go though, and if we’re discussing the “correct” value of rating points per handicap stone at different ranks, and the accuracy of the predictiveness of both differing strength players in an even game and the impact of handicap stones on the outcome at different points in the rating spectrum, accurate correction becomes anything but straightforward.

Just to make things more complicated, there’s also now a very subjective side discussion on whether a “correctly handicapped” game should aim for 50/50, or whether it should be favouring the white side. There’s a lot of questions here that can only be answered by “what OGS want to do with their system” rather than have an underlying objective truth.

Just to sound like I’m not just being a troublemaker and poking holes in everything, I would like to say that the effort that the OGS team are going to to address this is considerably more than I have ever seen anywhere else. I suspect that whatever solution that is chosen will be more than good enough regardless. I was happy for many years playing on servers that just had the far more simpler systems of “win = X points” (IGS, now Pandanet, no idea if they’ve retained the system) or “X wins over Y games = promotion” (one of the Korean servers, can’t remember which) and didn’t seem to suffer as a result of it.

6 Likes

Well, the difference between ranks is 100 GoR (by definition of the EGF rating system). But 100 GoR is not the same as 100 Elo. See this diagram:


I was using the blue line for my previous calculation. The green line is what the EGF uses, but it’s not really aligned very well with the actual data (blue line).

For the EGF the basic assumption is that handicap defines rank differences (after correcting for that half stone advantage for white). The EGF rating anchor is 7d where pro level play is assumed to start.

I suppose that a simpler system will work fine for a while, but the EGF system has been running for decades without adapting the system and the overall inflation/deflation seems to be fairly small. I can imagine that a go server faces extra challenges for maintaining ratings, but I think the EGF system has proven its quality for offline tournament games (which doesn’t mean it’s prefect).

2 Likes

The main reason I think OGS should make it so that handicap games favor white by 0.5 stones on average is so their ranks have a better chance to be similar/comparable to others. EGF, AGA, KGS, and IGS all use this assumption. So if OGS does not use that same assumption, the definition of ranks is different, and will probably lead to differences.

There is a good theory for why it’s done this way. The difference between H1 and H2 is 1 full stone. Same for H2 to H3 etc. It’s as if black plays a move, white passes, then black plays again. But the difference between H0 and H1 is not 1 full stone, it’s 0.5 stones. In a H0 game, Black pays 6.5 komi. To make a full stone difference, it would be like Black passes, and White moves first instead. In that case though, White is moving first, but still receives 6.5 komi. According to this logic, a true 1 full stone handicap game would be where Black moves first and receives 6.5 komi (-6.5 komi). Since the traditional system makes a H1 game 0.5 komi, that is only half the points compensation of a theoretical 1 stone difference.

It’s true in real life we don’t know if these theories match reality. But IMO OGS should not create their own new theory, because that will make their system different from most other systems out there now.

3 Likes

I didn’t address this part, let me try to clarify what I mean. The way handicap for a game is picked varies based on the situation, but let’s assume we’re talking about a situation where full handicaps are picked based on ranks. Then a 5d vs 2d they will play a H3 game. Black will place 3 stones, and give a komi of only 0.5 points. But for the purposes of rating the game, the real compensation is only 2.5 ranks for black, making such that white is the favorite instead of a true 50/50 game. In casual English we of course still call this a “3 stone handicap game” and the EGF stats of H3 games are for this situation. But when doing ratings, you need to be more precise, and this “H3” game only compensats Black 2.5 ranks, not 3.

3 Likes

I suppose there are a few fundamental running discussions now:

  1. Should 1 rank between players be designed to always be the equivalent of a one stone handicap. I.e. does “correct handicap” define rank? This is an optional assumption, it comes with the advantage that handicapping is fundamentally very easy. It comes with a disadvantage that rank progress feels inconsistent for an improving player. Whether it should be consistent is of course the question itself. EGF says uses this relationship between rank and handicap as a basic assumption. Familiarity is obviously another advantage for those playing from Europe. I don’t know whether it’s just as much the case in the US and the AGA for example.

  2. Familiarity / clarity or presicion? When this site started, you had two handicap choices, one of which was designed traditionally with 0.5 komi for 1 stone (black moves first), 2 stones and 0.5 komi for 2 stones (White moves first), etc etc, and it also had a “true handicap” option which would move the komi itself between 6.5 and -6.5 as well as using stones to determine the accurate handicap. The idea was offer extra precision, but very few people wanted it - traditional handicaps were traditional, people “got” them, and didn’t mind the possibility that the game was ever so slightly biased in one player’s favour. OGS has to decide where to draw the balance between familiarity and precision. We’ve already seen people’s confusion with the issues around averaged rating adjustments leading to losing rating points with a win.

  3. How to keep the “connected but different” issues of “handicap stone values in terms of rating points (and rank, although depending on the decision in 1) this is the same thing)”, and “corrections / adjustments for the result for a handicap game”. Do the rating adjustments after a handicap game have a smaller impact compared to an even game … on rating? on volatility?

  4. The ongoing debate on how expected results are affected by differing time controls has an impact on the effective value of the handicap stones too. Shorter time controls will change the “appropriate” handicap between two players of different strengths.

(5) And separately to all of this, and mostly connected to the other thread, how to handle decisions about extending the rank boundaries beyond traditional ranks. Should they be largely related on consistency of being able to give X handicap stones between players of those ranks, or should they be related to wanting players at those ends of the playerbase to be able to see their rank respond to their results?

4 Likes

For those interested, this is a relatively interesting paper published on how Go engines used adjustable komi to steadily adjust for handicap stones over the course of a game to stop engines playing nonsense in high handicap games: http://www.pasky.or.cz/go/dynkomi.pdf

FWIW, their default value for seeding their engines to deal with overcoming a handicap by not seeing themselves as hopelessly behind was 7 points per handicap stone reducing linearly down to 0 at move 200. So even though logic dictates that the difference between 6.5 and 0.5 komi is really only “half a stone”, it’s interesting to see that monte carlo engine developers of around 10 years ago felt that it was the equivalent to a full stone in terms of “balancing” it’s play. Infer from it what you will :slight_smile:

3 Likes

I would consider that strange, but anyway we now have a much better way of evaluating the value of different handicaps: KataGo.

Handicap  Estimated  Ideal
1           6          7
2          20         21
3          34         35
4          48         49
5          60         63
6          74         77
7          87         91
8         103        105
9         115        119
10        127        133
11        139        147
12        153        161
13        170        175
14        186        189
15        206        203
16        219        217
17        222        232

This is quite consistent with each handicap stone being worth 14 points, except for the 1st one, which is worth only 7 points. It supports the idea that traditional handicaps are half a stone short of full compensation for the rank difference

Copied from a post of mine in the L19 forum: https://lifein19x19.com/viewtopic.php?p=253248#p253248

6 Likes

Hello, thanks for sharing the details of the changes and the related analysis @anoek. Are you sure that it is now working as intended? I just won a game (https://online-go.com/game/24560881) and my rank went down from 6.1k to 6.2k, when I would have expected it to go up considering your explanation of the update.

2 Likes

@alemitrani It’s still possible, I just updated the text up above as to why… basically if your rating 15 games ago went down, you’ll be starting at a lower rating when computing your updated rating.

6 Likes

ok, thanks for explaining that.

3 Likes

You can read the paper on Prof. Glickman’s website. Ignoring rating deviation, the equation for expected win probability is 1/(1+10**(-d/400)), where d is the difference in ratings. So given a 100 point rating advantage, you’d expect a 64.0% winrate.

2 Likes

I still believe that nothing of this is based on math. With math we merely can describe the behavior of the higher entity. Just like OGS has a soul that directs its course. Ranking system also has a mind on its own. And They decide how to adjust player’s ranking. It’s much easier to think this way.

Praise anoek, the Great Inventor! Praise Glicko, the God of Ranks!

10 Likes

This doesn’t fix the main problem with OGS rating system, which is that it ranks me 2 stones weaker than every other server does, making me not want to play on here.

Change the ratings of everyone so that they are 1 stone higher than those people are on every other server and watch your playerbase increase.

1 Like

Thanks, so the rating “unit” is the same as Elo (at least the USCF version, but FIDE is very close).

2 Likes