The question is still the same, we cannot even be sure and agree upon human strength, and human rating and ranking. Wouldn’t it make the estimation even less reliable?
Neither of these estimations are particularly accurate, that’s for sure. If nothing else then for insufficient amount of data.
But with a closed pool of players, the distance between 3d and 4d is objective - 1 actual stone. The winrates in 1d vs 2d, or 3d vs 4d games is also objectively measurable. Once you label - even if arbitrarily - a level as “1d”, the distance from that point can be estimated, based on objective data only.
Player strength isn’t something that is discrete and can be easily put into clear-cut brackets. People’s strength spread across “ranks”, and the gap between dan rank constantly changed throughout history. It used to be every 2 rank for one stone different in dan(before pro and amateur split), and then every 3 rank for one stone, etc. So if the gap is arbitrarily chosen and ranks are more or less fluid over time, then the “objective” strength labels for any match itself is actually subjective. (compare something uncertain with something uncertain, to measure another uncertain range, and the outcome comes out certain?)
The only thing we might be able to tell is just some kind of larger trend, and any regression can be fit with all kinds of regression curve when we only have the beginning part of the trend. We are assuming people or AIs are close to the end of the spectrum, so the projected perfect play can be seen on a not distanced wall. But that’s just what we believe to be, not necessarily true. And the majority of the data we have is in sdk matches, and they are definitely at the beginning of the strength spectrum.
Maybe we can analyze the predicted points difference in AI to AI matches in every move and run them billions and billions of times, and see how accurate their estimation in every step to actual outcome (and force AIs not to quit and play out yose), see if they actually projected close enough, then we might be able to say something about the “consistency of AIs’ score estimation”
I meant objective in the sense that in a sound/modern rating system, when a group of 2d players play several matches vs 3d players, with 1 extra handicap stone the results will be close to 50%.
Do we really have data to support that? How do you define a player belongs to the “2d group”. I used to have a rank of 2d, when did I stop being part of the “2d group”? How long do players have to “keep” their ranks to belong to certain groups? If you say just trust the 2d label in a record, then we are circling back to how the label was placed in the first place. If today we label everyone in OGS as 2d and every one in IGS as 3d, and the result of 0.5 komi games ends up close to 50/50, does it mean the “rank label” is trustworthy?
And from the handicap results, 0.5 komi games aren’t that many (less than 1% I think). If we have IGS historical data, we might have more handicap games to analyze though.
This is what I wrote above: it doesn’t matter how. An arbitrarily set “2d” label is also ok - as long as you correctly rate matches, following the 1 stone differences compared to that level. That is: the internal consistency of a system and rating pool is honored. So the same level can be 2d on one server and 5d on an other server, no problem.
That’s basically the content of gennan’s post #45. The winrate of (n+1) dan vs. n dan corresponds to the Elo difference between (n+1) dan and n dan. We have data for human ranks, and we guess by curve extrapolation the rank at which the winrate of (n+1) dan vs. n dan will be 100%.
Here is the whole picture.
Handicap (columns) VS opponent’s rank (rows).
Colour and shape are for outcome: opponent’s wins are blue, pointing up. Opponent’s loss are orange, pointing down.
Size is for number of games.
Rank 20 is for 10k. Rank 30 is for 1d.
Up to 6 stones, katago-micro wins a lot.
From 7 stones up we start recognising a sort of boundary between wins and losses.
Here I zoomed in a little and switched to a pie chart. Orange kata won, blue kata lost.
I also removed smaller dots (up to 100 games).
I sort of recognise some boudaries.
We’re far from having a crowded points cloud with easily recognizable areas.
I think so. It’s hard to tell, but actually the pie chart one was a really nice idea. I guess one is hoping for the vertical line pie charts (~50/50 split) to be the appropriate handicaps.
But yes it’s maybe not very conclusive, but it’s really fun to look at
Doesn’t this come back to the idea of discarding “perfect play” and just using Katago as an upper bar to define ranks? Since this whole system would revolve around Katago’s evaluation of point losses (and not point losses compared to perfect play).
It could be.
But I think it would be necessary to have way more reliable data, that is lots of ranked games.
Provided the overview that we’re taking from this analysis, we should encourage all SDK and dan players on OGS to play a bunch of ranked games against that bot (or any other bot that we want to use as reference) in order to have appropriate Glicko2 ratings for all of them, including the bot itself, which so far has been evaluated only against other bots.
I think you are missing the point here. The best guess of any rating system (the problem of cross-platform comparability will just create more uncertainty), even the one here, at best is giving us rank ± 1 rank. Even if a rating system can give ± 0.5 rank at 1d to 2d range (which I doubt it can), giving your own argument of 50% uncertainty in rank at one rank gap, meaning the perfect play is 50 rank ahead of the 1d to 2d rank. This is what I was talking about the “projection” of a local uncertainty toward a target we are not even sure how far it is from us.
That’s what I was trying to say with this
If we can be convinced about the AI to AI consistency in their own gap, then we can use it as a “measuring tap” so to speak to measure ours and makes the extrapolation more meaningful (still has the tiny problem of whether the AIs could be trusted in their own opinions, since they cannot explain how they end up with their “score estimation” to us, and we just have to believe their blackbox output, these AIs could still be miles away wrong if in the future, where much powerful computers can compute trillions and trillions time faster than ours)
As I described, I’m using the EGF handicap/ranking system of 1 stone per rank, which is very similar to the handicap/ranking systems of OGS and KGS. AFAIK this system has been standard in the West for at least 50 years, so I assume @jannn also has that in mind.
For the regression of the Elo width per rank above SDK level, I think there is enough data available outside OGS.
In the (EGF) amateur dan range, there is 25 years of tournament data in the EGF rating system, with published even game win% between different ranks, from which Elo gaps can be estimated.
We also have global pro range data in the form of Elo ratings from Go Ratings, where the Elo ratings of pro players covers a range of roughly 1000 points (2700-3700). And we also know that the proper handicap between the strongest and weakest pros is probably not more than 3 stones without komi (EGF rank gap of 2.5).
The EGF rating system anchors 7d amateur to the lower end of the pro range, so I’m matching that 1000 Elo range to the EGF rank range of about 7d-10d.
These data sources are not perfect ofcourse, but I feel that everything together is more than just anecdotal data and hand waving. The combined picture looks consistent enough to me, to estimate infinite Elo (perfect play level) occuring around 13d on the EGF rank scale (within 1 or perhaps 2 ranks).
I am not saying the system as a whole couldn’t have a steady trend between ranks that we know, it’s like we are all bicycle racer in a journey to a faraway uphill mountain target, most just couldn’t get far, and stop early (mostly around SDK), and we can use the “stamina and strength” of each person as indirect measuring tap, based on two people race together comparably to judge each other’s distance without actually “objectively” measure the distance between each person if everyone join the race that branches into many tracks where most just get lost.
The problem is that no one has the stamina to climb to the top of the mountain, and we are extrapolating this “stamina” with something else (AIs) that are much powerful running ahead of us, on hardware that human could never reach, like on a powerful car. So the issue is, since no one can measure up to the AIs in one-on-one race, even if they have massive head start, and those that actually can, are far and few between, how could we use the human race “estimation” system to judge the AIs to human condition? or even AI to AI competition? We could ride on these AIs to see their race together and their reporting scores, but hardly be able to steer these powerful machines (and haven’t actually get out of the car and measure the AI self-report score estimation compare to the mark on the ground yet). How do we even know the race result between AIs are running on the right “tracks”, instead of just taking some detour routes based on AI’s own “self-reinforced” training. They follow their own easy paths (AI joseki, AI openings), just like we humans follow our own easy paths. Saying that using the past trajectory as guilds can extrapolate the final destination is like running in a maze and knowing you are somewhere close to the center, and claim you will reach it in no time.
I have some difficulty in following your analogy, but are you suggesting that it is impossible to measure the levels of AI and (hypothetical) perfect players on a human rank scale based on handicap (such as the EGF rank scale)?
I don’t see why that would be the case. Human players can and do play handicap games against AI to determine the handicap at which they can beat the AI. From experiments that I have seen and heard of, pros typically need a couple of stones handicap to beat strong AI. At about 2 stones handicap, the odds seem to favour the AI. But at 4 stones handicap, the odds seem to favour the pro. That implies that the rank gap between pros and AI is roughly 2.5 EGF ranks (3 stones).
Why dismiss this kind of empirical data offhand?
Similarly, it should be possible to use handicap to measure the rank gap between different AI and the rank gap to a perfect player (if it were available to us). Or do you think that a perfect player can beat all non-perfect players with an arbitrarily large handicap? I don’t think that’s possible (unless the perfect player is allowed to cheat, for example by mind-controlling their opponent).
Or do you consider a system of ranks based on handicap as inherently invalid or flawed? If that is the case, there is not much I can say, because we are on different race tracks altogether.
I am not just making metaphor for metaphor, imagine each possible position of a board is one point in space, and it is connected to another position via the next “legal move”. If we connect every position, we will get a super giant web of maze.
Some points in this web are “ending” positions where both sides cannot play anymore, and we can get absolute scores (mark on the ground) out of them (including some unfinished positions where boundaries are settled, so in fact these ending points are actually “ending points area”, where once someone enters it, they couldn’t get out the area, and experienced players know how to “close the door”). And now we mark some points as the race starting entrances in this maze (such as certain possible arrangements of pure black stone(s) in limited board positions, the “handicaps” including even games), then the question of playing a game would be transformed into how player in pairs traverse this giant web of maze. One must take turns navigating this maze to get closer to one of the “ending points regions”.
In this giant landscape, if they start in the race in “handicap” starting points, then the white winning regions will be far less than black winning regions, so the navigation of increasingly higher handicaps starting points will require more “stamina/strength/calculations” whatever you would call it. But the assumption that if we line up and transfer the measurement of the navigation skill near the handicaps area to the race from the “even start” area with probability to reach close by winning regions is just that, an assumption. It’s just a type of mapping (projecting) the giant “state space” (361x3 state for 19x19) into a much lower dimension with rating and ranking (say map the maze into a 2-dimention landscape and the height is equivalent to rating/strength needed to reach it).
What any rating system do is put rating numbers to certain points in the projected maze (every game position in records, if a position like the starting positions have everyone’s rating on it, just aggregate them, like average them), and then try to rearrange the landscape, where the rating landscape would be in acceding order toward the center of the maze (or high ground in the winning regions). This would distort the natural state space topography, where the white winning regions should be pretty far away from the even-game area, but in the projected map, they are very close, since the higher the “player’s strength” the closer the mapped points toward the higher rating ground (and better players will have higher rating to play high handicap games and few to average them out). But in reality, there are so few paths connected the halfway points in the even game area and the handicap halfway points, the tracks might as well be completely spectated. And the rating gap is just setting checkpoints in the positions and force an equal distance unit between them after each match, and updating the landscape accordingly.
Now the final question is how many units in distance after updating the landscape with all possible games lead us to from the starting entrances to the central winning region peak?
What I was asking (and saying) is that if we use certain projection (rating system), are we too optimistic in our abilities with just exploring near the outskirt area and get a projection that the central high ground isn’t really very big at all? (according to our human ranking, very tiny if it is just 13 dan). Should we at least use the AIs to get a feeling of how rough the terrain is by at least evaluate their ability of estimating the roughness of the landscape before jumping to conclusion? Since their abilities and explored areas are certainly far wider than humans can do.
What do you mean by “perfect player” for White in a handicap game? Unless you are talking about resigning on move 1, the best moves in handicap depend on your opponent’s weaknesses, so there is no single strategy that is always best.
I mean a perfect player that has perfect knowledge about the game and only plays moves that it knows for certain to not lose points against another perfect player, even if it knows or notices that their opponent is not actually a perfect player.
So it would never use point losing trick plays to speculate on the imperfect knowledge of the opponent. It would only exploit opponent mistakes after they are made. In other words, a “go god”.
This is different to a “go devil”, a type of player with perfect knowlegde of the game and also perfect knowledge about the opponent’s mind, so it can perfectly predict mistakes of their opponent and exploit those even before they appear on the board.
A “go devil” cannot beat a “go god” in an even game with perfect komi, but it would be able to give (much) more handicap to fallible opponents. But a “go devil” would basically be cheating, so I don’t consider it a proper reference point to determine the handicap distance from perfect play of a fallible player.
(also see other forum topics where this distinction came up)
My definition of the “perfect player” is an agent that already traverse every point in the game positions “state space” (as described from my previous reply), and able to always find the shortest travel path (not the direct path, since there are many illegal positions in the full state space), from any point toward the closest winning end-points.
Then (by @gennan’s definition) there will be a whole range of “perfect players” with different choices among points-equivalent moves. Some will do better against bots and humans, for example by choosing the more complex moves.
I would think that a bot around the 5-10k range would be ideal as an anchor, since it could play handicap games against a majority of players and have a big influence on the rating system. But it should be well-defined enough that when we run it 10 or 20 years from now it’s still guaranteed to be the same strength.