Rating system issues

Vsotvep · July 3, 2020, 3:20pm

I meant, this might not be true, but it is how rank difference should be measured, since it gets rid of the arbitrariness of handicap stones (which are frankly a different game being played). Especially, handicap stones are not a good measure of rank difference between very strong players.

Using winrates, the difference between 15k and 13k should be as large as the difference between 5d and 7d.

gennan · July 3, 2020, 3:22pm

Ok, but then this would break the relation between your ranks and traditional ranks. It would just be a completely different ranking system

Vsotvep · July 3, 2020, 3:27pm

We’re actually already using this system: it’s the Elo system, and with it the Glicko(2) system.

And that dan ranks traditionally are based on handicap stones is not completely justified either. For instance, with professional players, dan rank does not correspond to handicap stone difference (even if we ignore weird promotion rules based on winning certain titles): the distinction is too fine to give whole stones. And, as I mentioned above, in Japan dan ranks have become a lot more spacious (meaning the difference in rank is sometimes less than the needed handicap stones), due to dan ranks being sold for money.

gennan · July 3, 2020, 3:30pm

Well it isn’t, because OGS converts between Glicko2/Elo ratings and ranks. And the conversion formula is not 150 Elo = 1 rank (150 Elo means 70% winrate). It’s more like 50 Elo (roughly 60% winrate) = 1 rank (which corresponds to observed winrates between consecutive EGF ranks in the kyu range).

70% winrates (=150 Elo gaps) between consecutive ranks are typically observed around 5d EGF.

65% winrates (=100 Elo gaps) between consecutive ranks are typically observed around 1d EGF.

DVbS78rkR7NVe · July 3, 2020, 4:15pm

70% and 85% are much more arbitrary than handicap stones.

Vsotvep · July 3, 2020, 4:19pm

I used 70% / 85% pretty much as arbitrary numbers and I have no idea if it’s anywhere near what is measured in real games. Make it 55% or 90%, same difference for what I was trying to say.

My point is that Elo is something that is measurable regardless of the strength of a player. It’s possible to make a good prediction of the winning chances between players by simply subtracting their Elo score. It doesn’t depend on the actual strength of the players.

Ideally handicap stones correspond to a fixed difference in Elo. They don’t, though. That was what I was trying to say.

shinuito · July 3, 2020, 4:39pm

I mean could you come up with a rating system that achieved this?

I’m not sure what the requirement would be for this to work. You assign numbers to people to give them relative strengths and the difference between their numbers is some indicator of the likelihood one beats the other.

Like there should be some probably function p(R_b-R_A,x_1,…,x_n) that only depends on the difference in ratings and some extra tweakable parameters. Maybe one parameter (x_1) is the number of handicap stones (or komi if you wanted to do it that way), and p(0,0,…) =0.5 ideally, p(100,1,…)=0.5, … p(100*k,k,…)=0.5.

I might look into how some of these rating systems were devised if I get a chance.

gennan · July 3, 2020, 5:14pm

So you consider 7d-9d amateur as overlapping with professional, as the EGF ranking system does.
@Clossius1 considers only 9d amateur as overlapping with pro level, as is the case on Asian go servers. This is already quite a discrepancy.
I’m curious which amateur ranks the AGA sees as overlapping with professional strength. And what about OGS?

We could make an international standard by anchoring everybody to KataGo by the handicap they need to score 50% against it.

We may guesstimate that KataGo needs about 2.5 stones handicap (or 21 points reverse komi) against a hypothetical perfect player. I’m using 14 points komi as the distance between ranks (the value of a full handicap stone). This matches quite well with KataGo’s estimates of komi values of handicaps.

So when we say that the perfect player is level 0 (and higher levels are weaker), KataGo may be about level 2. Yeonwoo 1p may be about level 5 (she seems to need 3.5 stones handicap to win 50% against KataGo). A world top pro may be level 4 (needing about 2.5 stones handicap = 21 points reverse komi against KataGo?). An EGF 1d may then be about level 11 (needing about 6.5 stones handicap against Yeonwoo?) and a 29k may be about level 40.

With this ranking system there would be no arbitrary boundary between kyu and dan.

But single digit levels would still carry prestige (even though the decimal system is quite arbitrarily linked to the number of fingers we have). Single digit level would probably carry even more prestige than current “dan”, because level 9 would be about mid dan EGF. So reaching level 9 would be even tougher than reaching “dan”.

Determining if you are level 9 is as easy as firing up KataGo (giving KataGo at least 10k playouts per move, probably requiring a machine with a decent graphics card) and beating it about 50% of the time with a handicap alternating between 7 and 8 handicap stones (or 91 points reverse komi).

One would still have to use some maximum handicap of 9 stones (~ 120 points reverse komi) between players, because at some point the handicap system breaks down. Even the perfect player cannot win when giving 361 reverse komi under area scoring.
So this method can only be used directly to determine levels 10-0. Beyond that, levels would just be determined by handicaps against players with an established level (just as normal).

gennan · July 3, 2020, 5:53pm

I think you’d need to put a comma between R_A and R_B, like p(R_A, R_B,x_1,…,x_n), because it does not depend only on rating difference, but also on the ratings themselves.

meili_yinhua · July 3, 2020, 9:22pm

funnily enough, Elo already does this
E(r1,r2)= 1/(1+10^((r2-r1)/400))
is the chance of the player with rating r1 beating the player with rating r2

Glicko follows a modified formula that incorporates Ratings Deviation, but when the opponent’s RD=0 it’s essentially the same formula.

shinuito · July 3, 2020, 9:45pm

I know elo already does this how does it incorporate knowledge about the game itself? How do I add a parameter to it to say that games aren’t even games, they have x handicap or komi? I suppose you could just define a certain set of ratings to be x kyu and y dan, and adjust players rating by say m*100 hundred points to account for m handicap stones (assuming rating gaps work in 100’s) before inputting into the formula.

I’m just trying to understand @Vsotvep’s point. I will probably just sit down and look at these rating systems, slightly worried I’ll get sucked into playing with the github data.

It’s not like the asian servers are filling up with bots pushing people out of 9d rating spots.

gennan · July 3, 2020, 9:58pm

In general, gaps between handicaps would not be equal to 100 Elo in OGS (or any other go rating system AFAIK).

Yes, in the EGF system, gaps between handicaps are 100 GoR points (by definition), but GoR points are not Elo points. From my analysis of the EGF data, gaps between handicaps in terms of Elo varies from about 35 Elo around 20k EGF, via about 100 Elo around 1d EGF to about 200 Elo around 8d EGF.

I can give the function that is predicted by the EGF system and a function that matches the actually observed data in the EGF tournament games (they are different), but I suppose you’re more interested in the OGS version.

meili_yinhua · July 3, 2020, 9:59pm

well, there is a way to do this, but you’d probably wanna go back to the simpler version of the model: the Bradley-Terry Model

so the Bradley-Terry Model essentially says the chance of A being observed better than B is A/(A+B), interestingly enough this is where the elo formula comes from.

If you replace A with 10^(r1/400) and B with 10^(r2/400), you get 10^(r1/400)/(10^(r1/400)+10^(r2/400)), and then if you multiply to and bottom by 10^(r1/400) you get 1/(1+10^(r2/400)/10^(r2/400)) which equals 1/(1+10^((r2-r1)/400))

so let’s go back to the A & B units for simplicity, with the memory that A=10^(r1/400) and B the same for r2

and we’ll say that A is the one getting the handicap benefit h, which we will suppose is a function of rank as h(r1). Then the proper formula for the update is (A+h(r1))/(A+h(r1)+B), which can be done by having a similar function increasing the number of ratings points of r1 for the purpose of calculating expectation

gennan · July 3, 2020, 10:10pm

Indeed, the Bradley-Terry model would be more suitable.
From https://en.wikipedia.org/wiki/Bradley–Terry_model

logit( p ) = Bi - Bj

I get a good fit for EGF game data with B( r ) = -6 * ln(3200-r) where r is rating expressed as GoR (with GoR 2100 = 1d and 100 GoR gaps between ranks / handicap stones).

A conversion between Elo and B would be
Elo( r ) = B ( r ) * 400 / ln(10) + C

C is an arbitrary constant. C = 9100 seems to give a fairly good match with Elo ratings from https://www.goratings.org/en/

shinuito · July 3, 2020, 10:15pm

What’s the difference between a GoR rating and an Elo rating, I thought GoR is derived from an Elo system with some extra tweaks?

gennan · July 3, 2020, 10:17pm

Yes, the most defining tweak is that 100 GoR points means one rank/handicap stone instead of 65% winrate like in Elo. So that’s a major tweak.

gennan · July 3, 2020, 11:27pm

Expressing the Bradley-Terry B function in terms of those levels would be even simpler: B( n ) = -6 * ln( n * 100 ) where n is the level.

Mafrano · July 4, 2020, 8:22pm

Small mistakes like this are inevitable in any complex software. The big picture is that the OGS rating system is very good.

The impact of this minor error is limited in time - it will soon be fixed in the code, and in a few weeks from now, ranks will have adjusted and the glitch will be forgotten.

Samraku · July 4, 2020, 11:27pm

Here’s the real question, for which I think we need @Gia’s input: does this count as part of the Cor-corr incident?

Gia · July 4, 2020, 11:45pm

This is a stand-alone feature, and will be shipped to theaters as “Meltin’ Ratings”.