2020 Rating and rank tweaks and analysis

KillerDucky · June 26, 2020, 3:27pm

I didn’t address this part, let me try to clarify what I mean. The way handicap for a game is picked varies based on the situation, but let’s assume we’re talking about a situation where full handicaps are picked based on ranks. Then a 5d vs 2d they will play a H3 game. Black will place 3 stones, and give a komi of only 0.5 points. But for the purposes of rating the game, the real compensation is only 2.5 ranks for black, making such that white is the favorite instead of a true 50/50 game. In casual English we of course still call this a “3 stone handicap game” and the EGF stats of H3 games are for this situation. But when doing ratings, you need to be more precise, and this “H3” game only compensats Black 2.5 ranks, not 3.

topazg · June 26, 2020, 4:15pm

I suppose there are a few fundamental running discussions now:

Should 1 rank between players be designed to always be the equivalent of a one stone handicap. I.e. does “correct handicap” define rank? This is an optional assumption, it comes with the advantage that handicapping is fundamentally very easy. It comes with a disadvantage that rank progress feels inconsistent for an improving player. Whether it should be consistent is of course the question itself. EGF says uses this relationship between rank and handicap as a basic assumption. Familiarity is obviously another advantage for those playing from Europe. I don’t know whether it’s just as much the case in the US and the AGA for example.
Familiarity / clarity or presicion? When this site started, you had two handicap choices, one of which was designed traditionally with 0.5 komi for 1 stone (black moves first), 2 stones and 0.5 komi for 2 stones (White moves first), etc etc, and it also had a “true handicap” option which would move the komi itself between 6.5 and -6.5 as well as using stones to determine the accurate handicap. The idea was offer extra precision, but very few people wanted it - traditional handicaps were traditional, people “got” them, and didn’t mind the possibility that the game was ever so slightly biased in one player’s favour. OGS has to decide where to draw the balance between familiarity and precision. We’ve already seen people’s confusion with the issues around averaged rating adjustments leading to losing rating points with a win.
How to keep the “connected but different” issues of “handicap stone values in terms of rating points (and rank, although depending on the decision in 1) this is the same thing)”, and “corrections / adjustments for the result for a handicap game”. Do the rating adjustments after a handicap game have a smaller impact compared to an even game … on rating? on volatility?
The ongoing debate on how expected results are affected by differing time controls has an impact on the effective value of the handicap stones too. Shorter time controls will change the “appropriate” handicap between two players of different strengths.

(5) And separately to all of this, and mostly connected to the other thread, how to handle decisions about extending the rank boundaries beyond traditional ranks. Should they be largely related on consistency of being able to give X handicap stones between players of those ranks, or should they be related to wanting players at those ends of the playerbase to be able to see their rank respond to their results?

topazg · June 26, 2020, 4:26pm

For those interested, this is a relatively interesting paper published on how Go engines used adjustable komi to steadily adjust for handicap stones over the course of a game to stop engines playing nonsense in high handicap games: http://www.pasky.or.cz/go/dynkomi.pdf

FWIW, their default value for seeding their engines to deal with overcoming a handicap by not seeing themselves as hopelessly behind was 7 points per handicap stone reducing linearly down to 0 at move 200. So even though logic dictates that the difference between 6.5 and 0.5 komi is really only “half a stone”, it’s interesting to see that monte carlo engine developers of around 10 years ago felt that it was the equivalent to a full stone in terms of “balancing” it’s play. Infer from it what you will

gennan · June 26, 2020, 4:44pm

I would consider that strange, but anyway we now have a much better way of evaluating the value of different handicaps: KataGo.

Handicap  Estimated  Ideal
1           6          7
2          20         21
3          34         35
4          48         49
5          60         63
6          74         77
7          87         91
8         103        105
9         115        119
10        127        133
11        139        147
12        153        161
13        170        175
14        186        189
15        206        203
16        219        217
17        222        232

This is quite consistent with each handicap stone being worth 14 points, except for the 1st one, which is worth only 7 points. It supports the idea that traditional handicaps are half a stone short of full compensation for the rank difference

Copied from a post of mine in the L19 forum: https://lifein19x19.com/viewtopic.php?p=253248#p253248

alemitrani · June 26, 2020, 4:47pm

Hello, thanks for sharing the details of the changes and the related analysis @anoek. Are you sure that it is now working as intended? I just won a game (https://online-go.com/game/24560881) and my rank went down from 6.1k to 6.2k, when I would have expected it to go up considering your explanation of the update.

anoek · June 26, 2020, 4:53pm

@alemitrani It’s still possible, I just updated the text up above as to why… basically if your rating 15 games ago went down, you’ll be starting at a lower rating when computing your updated rating.

alemitrani · June 26, 2020, 5:09pm

ok, thanks for explaining that.

jimbotronic · June 26, 2020, 10:25pm

You can read the paper on Prof. Glickman’s website. Ignoring rating deviation, the equation for expected win probability is 1/(1+10**(-d/400)), where d is the difference in ratings. So given a 100 point rating advantage, you’d expect a 64.0% winrate.

DVbS78rkR7NVe · June 26, 2020, 10:34pm

I still believe that nothing of this is based on math. With math we merely can describe the behavior of the higher entity. Just like OGS has a soul that directs its course. Ranking system also has a mind on its own. And They decide how to adjust player’s ranking. It’s much easier to think this way.

Praise anoek, the Great Inventor! Praise Glicko, the God of Ranks!

Alexfrog · June 27, 2020, 7:38am

This doesn’t fix the main problem with OGS rating system, which is that it ranks me 2 stones weaker than every other server does, making me not want to play on here.

Change the ratings of everyone so that they are 1 stone higher than those people are on every other server and watch your playerbase increase.

gennan · June 27, 2020, 7:53am

Thanks, so the rating “unit” is the same as Elo (at least the USCF version, but FIDE is very close).

Animiral · June 27, 2020, 10:03am

Should the other servers do the same then?
How long until we are all 9999 dan?

I think I would stop feeling flattered at some point

DVbS78rkR7NVe · June 27, 2020, 11:29am

See, told you. A lot of people want to be a dan without doing the work

Tokumoto · June 27, 2020, 11:30am

You have convinced me how good this change is, and I am happy with it. Just for the future, please think about why the combined overall rating is as good as it proves itself on the current dataset.

Common sense tells me Player A’s 19x19 game results show how strong he is on 19x19. But our new implementation of Glicko2 shows adding 13x13 and 9x9 results to the data (Is the weighting still 0.5 and 0.25 respectively?) is a better indication on the average, despite significant(?) portion of the players play one size only. Could it be Glicko is saying the number of samples is too small?

To gennan and EGF members:
A bit of historical background may help you understand what Glicko2 rating system is, and where OGS stands.

When Arpad Elo created a rating system and US Chess Federation adopted it in 1960, the system was based on many assumptions and simplifications. Major assumptions that are important for our purposes are:

Player performance is normally distributed - this was statistically proven inadequate when applied to the actual USCF win/loss data in the 1960’s, and USCF since switched to using Logistic Distribution, acknowledging player strength is not scattered totally random. World Chess Federation (FIDE) adopted Elo-based system in 1970, and still uses a modified Normal Distribution.
Player base is fixed - Of course USCF had new members coming in, and some players going out, but at the time, the effect on rating was (deemed?) insignificant.
The mean strength level does not drift away - This is often called the rating inflation/deflation, and the most serious of deflation is often the new comers obtaining rating at the lowest level, and outgoing members are much stronger on the average. This results in the mean of all strengths in the pool to drift lower and lower. Modern adaptations of Elo system often sets a Rating Floor, where newbies of strength less than this floor is excluded from the calculation to lessen this tendency.

Mark Glickman, a statistician and Chair of USCF Rating Committee until 2019, analyzed the Elo issues and behavior. He proposed to adopt the concepts of confidence level (ratings reliability, or ratings deviation -RD, the blue colored band above and below our rating on the graph on ogs player profile page-) and volatility (rating volatility -σ-) to come up with Glicko-2 rating system. It primarily addressed the difficulty in the actual implementation of Elo system to set appropriate K-Factors (how sensitively the rating should reflect the recent results, for what strength level), but in general, it is meant to be, and accepted in the statistics academia to be, a better and more accurate rating system when implemented well. This is, I believe, where topazg’s comments including “following Elo as a gold standard” are coming from.

OGS used a version of Elo rating from its inception, always included 13x13 and 9x9 game results, used 30k as the floor for years, then switched to Glicko-2 with 25k as the floor. When we made the switch from Elo, there were heated discussions on EGF rating system and how well it is performing over time. Based on the experience of using an almost identical formula to EGF formula for years, we generally agreed that Elo’s suitability for EGF does not translate to its suitability for online servers where the number of new comers is much larger, number of players leaving the system is larger, average game result number for a player is much larger, and “strong new comer” is not unusual.

Probably I should have posted such background info much earlier in this discussion, but I didn’t have time for it. I hope this explanation helps you in some ways.

topazg · June 27, 2020, 11:38am

@Tokumoto, Actually originally there was no 30k floor. 30k was -900, and it wasn’t even reached let alone surpassed early on because our player base wasn’t big enough. However, at some point we did have a couple of players at 32k and 34k and we had a couple of complaints from stronger players that these ranks “cannot exist and shouldn’t be represented”, and we artifically blocked the rank display. In hindsight, I feel this was an error in judgement caused by being the new kid on the block in many ways. Fortunately, the 32k and 34k didn’t complain.

@Alexfrog: I see where you’re going with this. I would just like to declare in advance that I won’t be happy until my rank needs to be expressed with an exponential. I will now need a rank of at least 1x10^5 to feel like a true dan player.

Gia · June 27, 2020, 11:42am

All the jokes I have for this would get me banned.

Alexfrog · June 27, 2020, 1:56pm

I’m almost Dan on Tygem, not even close here

topazg · June 27, 2020, 2:41pm

This is a genuine question, and as it’s been an ongoing undertheme of a lot of rank discussions here and elsewhere, it’s still very relevant to this thread:

If you were 2d on Tygem, but 2k here, why would it be so bothering to you that you were 2k here? Assuming that there was no particular reason you were actually playing worse on OGS, you’re still the same player, and still the same strength, and play just as well on both - why does the rank badge bother you so much?

This isn’t a leading question where I’m about to tell you it shouldn’t bother you, at the end of the day “you do you” etc … if it matters to you that’s fine, I just don’t truly understand why.

As yet another rambling side anecdote: It used to bother me more than it does now when a 4d I used to talk to considered themselves to be a “weak player”, I originally felt it was some kind of false modesty that probably actually had the impact of being condescending to kyu player. Now I think the feelings were actually just genuine. As a rule, learning a small amount in anything makes you think you know more than you do. Learning a lot in anything makes you realise just how much there is to learn, and you end up much more aware of the remaining knowledge holes.

As a second question, have you ever tried playing on OGS with ranks disabled for you and all your opponents?

DVbS78rkR7NVe · June 27, 2020, 5:03pm

Because all ranks have certain prestige and weight to it. From math standpoint it’s all relative and doesn’t matter but in practice ranks and certainly dan ranks are valued. They denote someone’s capability. In Japan dan certificates are sold, I heard Tokumoto say. And they’re sold because there’s some value in them.

So it might be that people already felt good about themselves, about their dan rank on foxy or whatever. And OGS puts them back at the kids table.

Of course, in my personal view, those foxy ranks are ridiculous. Foxy places people who barely know how to play in dan ranks, and it’s not right at all. Dan rank should signify some level of proficiency. So you can’t float your foxy rank in polite society unironically.

gennan · June 27, 2020, 5:14pm

I’m not doubting that the Glicko2 algorithm is more suitable for an online rating system than the Elo algorithm. Glicko2 requires more computation, but that’s not a problem in the computer age.

Also, I can totally understand that the rating sensitivity / volatility (K in Elo, con in EGF) has different optimal values for different environments (online play vs real life tournament games, player demographics, etc).

But I’m not talking about those things. I’m talking about the rating to rank conversion, which is not part of the Glicko2 system nor the Elo system. All the go rating systems use a different shape for this conversion (which in essence assigns different Elo widths to different go ranks). In the EGF system it is determined by the a parameter and @Anouk posted the OGS version. But when analysed, me and others find some quite apparent mismatches between any of these systems and actual statistics. So I am wondering.