What do you think Shin Jinseo and Katago's ranks would be on OGS?

Goratings says his elo is 3800, but the highest bot on here has 3500, which is 14d, and that can’t be right since bots are better than humans. Alphago zero is said to be 5,000 according to wikipedia. I tried to calculate it, and it’s starts getting absurd. Like 20d and stuff, and that can’t be right because even the weakest pros can’t be more than 5 ranks below AI. And on here they would be about 7 dan I imagine. So I think on here shin might be closer to about 3200, since there is a human user here with that rating. I also wonder where AI would plateau if it was allowed to on here. Lots of 9d+ bots stop playing ranked games before reaching a sure peak. I also wonder how bad someone can be. I’ve seen AI estimated to have like -1000 elo, but on here it doesn’t go under 100.

They’re all different rating scales, and so they’re more or less incomparable without a lot of users playing games on and having ratings on each rating scale.

Bots that have their own rating scales most likely aren’t anchored to any human rating data, but rather are just comparing versions of the bots against earlier versions.

The main reason a person or AI account will plateau is more to do with the opponents it’s facing, than anything else. If one AI was the best and had 100% winrate it would probably plateau more because it’s run out of challenging opponents, and the rating gain dwindles after a while when the result is very certain.

You kind of need active pools of users or bots at all levels near the strong human or AI to continuously funnel rating gains upward, or it will just stall out for that reason.

Likely they can gain less than 1 rating point for a win, but potentially lose 20-30 rating points for a loss or malfunciton, so it stops being worthwhile to aim to increase rating, without equal or close in strength opponents to gain rating from.

You can shift the whole system down arbitrarily to make the lowest user have the lowest rating that you want.

In terms of a biggest skill gap, the best bet I would think would be to make a sequence of random bots which play according to distributions. One can be uniform random, one can have preferences to play along the edge first, then in the centre later, one could prefer certain kinds of moves at certain stages of the game.

I think it’s difficult to come up with the ultimate worst play, while still being able to finish the game, since two really bad bots could kill all their living groups on the board, and reset the board over and over - likely hitting some cap on game length rather than actually ending a game properly.

5 Likes

The rating scale of goratings.org is not necessarily anchored the same as the OGS rating scale, so the absolute ratings may not be comparable. Only rating differences may be actually comparable.

4 Likes

Specifically, it uses whole history estimation. Which is more accurate, but also more computationally expensive.

So, unlike normal Elo, the points don’t just come from the player you are playing. It just assigns each player the optimum Elo that predicts their odds of winning against any other player. And it recalculates everyone’s entire score history after every game, even ones they didn’t participate in.

So, it is much more accurate for the top end when normal Elo would peter out due to lack of competition.

If you had a lot of results for katago vs the pro players it already rates, you could add it as a player and estimate katago’s elo that way. It’d be better if you could do it for multiple bots of slightly different strengths.

But, realistically, computers are so much better that you could just use the raw formula and stipulate that a pro player only wins say 1% of games, and then solve for the Elo that gives those odds.

2 Likes

This is forbidden by the superko rules.

1 Like

The OGS rating system was designed so that 1d on OGS is somewhere between AGA 1d and EGF 1d. European pros are about EGF 7d, I assume they’d be 8d on OGS but that’s hard to check because high dan players rarely play rated games here.
Assuming Shin Jinseo is 3 stones stronger than European pros, he would be 11d on OGS.
This thread
https://lifein19x19.com/viewtopic.php?f=12&t=19423
is about a Japanese pro, Nakane Naoyuki, who played handicap games against Katago. He seems about 4 stones weaker than Katago. Since Shin Jinseo’s gorating is 3863, Nakane is 3104 and European pros about 2900, perhaps Katago would be about 13d on OGS.

3 Likes

I don’t know how you calculated ranks from those ratings. Did you use OGS rating-to-rank conversion formula?

When I equate the ratings from OGS, DeepMind and goratings.org (their ratings do seem to align fairly well where they overlap), and I use the rating-to-rank conversion formula that the EGF rating system uses under the hood [*] (I think it’s more realistic than the OGS conversion formula in the high dan range), I get:

  • rating < 0 = weaker than about 45k (I’d say a 5 year old novice)
  • rating 2100 = about 3d EGF (and about 3d on OGS)
  • rating 2700 = weak pros, about 7d EGF (and about 9d on OGS)
  • rating 3100 = average pros, about 8.5d EGF
  • rating 3600 = top pros, about 10d EGF
  • rating 3800 = Shin JinSeo and AlphaGo Lee, about 10.5d EGF
  • rating 5200 = AlphaGo Zero, about 12d EGF
  • rating ∞ = perfect play at 13d EGF (by definition since 2021)

Strong bots on OGS with ratings in the range 2700-3500 would have ranks in the range of 7-10d EGF, basically overlapping with regular pros. This is significantly weaker than AlphaGo Zero, but these bots have much lower #visits than AlphaGo Zero.

With all of that, I’d estimate than Shin JinSeo would need 2 stones handicap against AlphaGo Zero, but an average pro would need 4 stones handicap.


[*] GoR = 3300 - exp(((10500 - Elo) / 7) * ln(10) / 400)
See also About Revised European Ratings

1 Like

Eventually yes, but if both players make an effort, it can take many, many moves before a cycle is unavoidable.

@john.tromp gave an upper bound of 4110473354993164457447863592014545992782310277120 moves for the longest possible game on 19x19 with positional superko (the most strict version of superko).

Also see https://tromp.github.io/go/gostate.pdf

Edit: Note that you’d need to have a phenominally huge memory to correctly detect all superko violations in such pathological games. The number of atoms in the visible universe is probably way too small for that. Besides that, the universe will probably have reached heat death (possibly rendering the passage of time meaningless) long before such pathological games would hit a position where all moves by both players are banned by the superko rule, forcing the game to stop and get scored.

3 Likes

Ok literally returning to the exact empty board might, but there 2*361, one stone boards, ~2*361*360 two stone boards and so on. Superko doesn’t care about symmetry.

There’s probably enough, that the person running the Go server might shut it down before a pair of really intentionally bad bots finish a game, and cycle through most of the legal go positions out there.

Edit: Ah I see @gennan already said something to that effect

2 Likes

“European pros are about EGF 7d, I assume they’d be 8d on OGS”
I also came up with the same estimate. This link says a OGS 7d (2487) is about a EGF 5.3d (2530).
https://idex.github.io/go-rank-survey/go-survey-results
Lukas Podpera is 7d (2700) EGF, so just add 170 to 2487, he’d be about 8.5d OGS. He’s been 7d for a while, but just always loses to pro qualifiering tournament a bunch of times in a row unfortunately. Go ratings puts him at 2800. Since shin jineo is 1000 points higher, that means his OGS rating would be 3800. Which would be 15d. There can’t be that much a handicap difference between them can there?
https://www.goratings.org/en/players/1524.html

I kind of want to make a bot plateau against humans on OGS, then make it repeatedly play a stronger bot to see how far that one gets.

I’m just really interested in how many theoretical ranks can exist in this game.

That seems to be the survey from 2018. I think the 2021 rating system update of OGS made that survey obsolete (at least in regard to OGS ranks). There was also a rating system update of the EGF in 2021, though with a smaller impact.

Also, EGF ratings are not Elo ratings (like OGS ratings). The EGF rating unit is called GoR. Elo and GoR mean different things. A 100 GoR gap means a skill gap of 1 handicap stone, while a 100 Elo gap means a 63% win rate in an even game. These different measures don’t really coincide, except around 1d level.

At the high end, I don’t think a perfect player can give Alphago Zero more than a few stones handicap. But at the low end, I don’t think there is a limit. A player or bot that always just resigns right after the game starts will eventually see its rating dropping towards negative infinity (in theory).

1 Like

OGS uses the formula rank = ln(rating / 525) * 23.15 where a rank between 30 and 31 is 1d.
For “normal” users of OGS, a rank difference of n corresponds to a strength difference of n stones. But Shin Jinseo may be above the range in which such a conversion is reliable. I.e. the formula ln(rating / 525) * 23.15 puts him 6 or 7 ranks higher than Lukas Podpera but he is not 6 or 7 stones stronger.

3 Likes

I thought there was a wide enough rank distribution that the current formula would be accurate. When it’s graphed it looked smooth and good. I guess we’ll never know till we get more high level players, like chess servers have. Which is why I’m happy to support if it helps.

That is pure conjecture, since we do not have anything like a perfect player in non-trivial board sizes. It is likely to be wrong if you take into account starting from any initial given position because one can encode QSAT problems as go problems.

Though I can’t give a mathematical proof, I do think there is sufficient empirical (statistical) evidence to support my conjectures.

Would you agree that go ranks (i.e. based on handicap game results) have a finite upper limit? I’m quite sure there will be a limit to how many handicap stones a perfect player can give to (say) a top bot. Or do you think it’s possible for a perfect player (i.e. a player that never loses points) to give 9 stones handicap to a strong contender that typically only loses a few points per game?

On the other hand, I think Elo ratings (i.e. based on even game results with perfect komi, possibly 6 under Japanese rules and 7 under Chinese rules) can go arbitrarily high.
When only playing even games with perfect komi, a strong contender that loses a few points per game will lose all the time against a perfect player that never loses points. By chance there may be an occasional jigo (the strong contender happens to also make no mistakes in some games), so in practice the perfect player may not actually reach inifinity Elo rating, but the perfect player’s rating can still become arbitrarily large as they continue playing only even games.

Also, from a dataset of game results annotated with the players’ ranks (which are based on handicap game results that may or may not be in the dataset), one can use logistic regression to determine the Elo gap per rank over the skill range of the players in the dataset. I have done that in 2019 for the games in the EGF tournament games database (about 1 million games, with arguably a higher quality representation of high dan games than OGS data has):

For ~10k players the Elo gap per rank is about 50 points. For 1d players the Elo gap per rank is about 100 points. For pros the Elo gap per rank is about 300 points. Based on the same dataset (I assume), a (French) American mathematician came to similar conclusions in 2016, including the apparent perfect play asymptote at ~13d EGF (see Elo Win Probability Calculator in section Elo per Stone).

Also, the Elo gap between AlphaGo Zero and AlphaGo Lee is already about 1500 points (according to DeepMind), while I think it’s safe to assume that the required handicap between them is probably not more than 2 stones (or do you think AlphaGo Zero could give AlphaGo Lee a much higher handicap and win 50% of the games?). This would mean that the rank gap between them is not more than about 1.5 ranks and the Elo gap per rank is at least about 1000 points around the level of those strong bots.

Based on all of that, I don’t think it’s too speculative to assume that the Elo gap per rank tends to to infinity in the limit of perfect play, while handicaps required (to even up winning chances) between progressively stronger players become smaller and smaller, tending to 0 in the limit of perfect play (note that handicaps can be made fractional by using komi handicap and switching between different komi values by some arbitrary frequency).

I don’t understand what exactly you mean by that. Would you expect a perfect player (which never loses points) to beat a top bot from any initial board position, no matter how unfavourable?

Exactly. There seems to be a wide consensus that strong pros can’t give more than about 3 stones handicap to weak pros (though Shin JinSeo may be exceptional in that he may be able to give 4 stones handicap to weak pros, expecting him to give Lukas Podpera 6 or 7 stones handicap seems too much).
So I think OGS’ rank conversion formula underestimates the Elo gap per rank in the high dan range. Or conversely, it overestimates rank gaps (required handicaps) between players with very high Elo ratings.

Also see 2021 Rating and rank adjustments - #59 by gennan (and there are several other discussions in the forums where this topic came up).

The alternative graphs (see links above) also look smooth and good, and they are based on (probably) higher quality data for high dan ratings.

Maybe a dumb question, but why can’t the weakest pro be more than 5 ranks below AI?

What now is absurd, can be normal tomorrow (and completely outdated next week).

I can buy that a perfect player (one that has an exhaustive tablebase) would always win against the second-best player (though even this isn’t true, as either they’re playing with Perfect and Fair Komi of 7.0 area and an almost as strong second player may hold the draw some games, or they’re playing 6.5 or 7.5 komi and the perfect player will start out losing half their games, which the second player if almost as strong, may be able to sometimes hold), but that just gets one player asymptotically approaching some number of elo above number 2. The rest of the field won’t have such one-sided records, and thus the rating will behave normally

Consider a hypothetic player who

  • Plays the opening perfectly;
  • Plays 70 middle game moves with probability 10% to make a 2 point mistake for each move, and probability 90% to find the best move;
  • Plays the remaining endgame moves perfectly.
    Such a player would lose on average 14 points compared to perfect play, so would be 1 full stone (=1 rank) weaker than perfectly play.
    That player would play perfectly throughout the game with probability 0.970 which is about 6.10-4.
    On the other hand, 10-1300/400 is also about 6.10-4.
    So it’s possible that a player who is 1 full stone weaker than perfect play is “only” 1300 Elo below maximum possible rating. But of course my model above is too simplistic so that estimate of 1300 Elo shouldn’t be taken seriously.
2 Likes

Let’s not forget the fact that both “ranks” and “Elo” are approximations that humans came up with, rather than being a thing that objectively exists as a property of a player.

The very concept of assigning ratings to different bots or players presupposes that it is possible to do so in a way that then for each possible pair of players, the ratings difference approximately predicts the chance of winning. This assumption tends to works surprisingly well so long as (1) all players involved are “natural” players (not artificial contrived strategies that will throw games or do anomalous things in specific cases, nor adversarially-optimized strategies designed to defeat a specific opponent), AND (2) the players involved are not too close to perfect.

Very near optimal play, it becomes a worse approximation. Mistakes become isolated and few and don’t “average out” in a law-of-large-numbers sort of way. You will happen to draw or win or lose a lot or not depending on the opponent just so happening to prefer a particular set of openings that a bot plays well, rather than other objectively perfect openings that the bot has a single blind spot for in a shape that results. A “stronger” opponent might just so happen to prefer a different opening and do worse by only drawing, etc.

This has been the case with 9x9 CGOS (computer go server), where among the bots near the very top, the ratings system works a lot worse. Even if you have a lot of games and converge, the rating you converge to depends on what other bots happen to be running a lot at the same time as you are, and there are inconsistencies in the all-time ratings as a result.

For now, I think 19x19 CGOS is still pretty good and doesn’t seem yet have the same issues, but if you’re talking about hypothetical perfect play even on 19x19, you’re probably into similarly sketchy territory. From 9x9 we’ve seen that ratings don’t become entirely meaningless, there’s still some decent amount of correlation, but the assumptions that the existence of a rating itself is founded on also don’t hold up that well either, so ratings work poorly and can’t be defined all that precisely.

3 Likes

I don’t expect a perfect player to actually reach some astronomically large rating in practice. I do expect it to eventually reach a very high rating (compared to current top bot ratings). However high that rating will be, I still don’t expect such a perfect player to be able to give 9 stones handicap to a top pro, so it should be ranked lower than 20d EGF (I expect more like 13d, able to give 3 stones handicap to Shin JinSeo)).

Ofcourse the Elo rating system itself is not perfect, and neither is the rank system based on handicap. Both systems assume that skill gaps between players are perfectly transitive and than these skill gaps can be projected on a single dimension without loss of information.
In reality those assumptions will only hold to some degree. But I still think it is a useful approximation that works well enough under the circumstances it is normally used for: a larger population of players with a somewhat homogeneous skill distribution, where everyone plays frequently against a wide variety of opponents where both players have an unnegligible chance of winning.

I think of it as if the whole thing can more or less be seen as a collection of atoms (players) that together form some sort of coherent macroscopic fluid body with a temperature (rating) gradient. Also in such an image, the potential simplifications are no longer valid when the model breaks down, like when you have only isolated clusters of atoms or other anomalous inhomogeneities.