It should probably be inconsistency, and not easy to quantify. One measure could be how a player performs vs 1 stone weaker opponents (the lower his inconsistency the higher his winrate). Or ask a strong bot.
I don’t think you can use consistency as synonym with strength though.
Even assuming they are indeed statistically correlated in a meaningful way, knowing the consistency of a given player will not be enough to deduce its strength. You can make a guess based on the general trends you observe (People with this level of consistency tend to usually be around this level) but an individual will always deviate more or less significantly from the statistical norm.
Also, since again the notion of perfect play is purely theoretical, I don’t see how you can measure the gap in strength between, say, Katago and perfection. You could measure a gap in consistency but how much stronger in stones would you need to be to bridge this gap?
In my view a ranking built on Go consistency would be just that, but cannot be used as equivalent with a ranking built on Go strength.
One question that is interesting to ask: does Katago with 2 handicap stones sometimes loses against itself? If the answer is yes, then Katago is more than 2 stones away from perfect play.
Basically, what is the win-rate vs player strength at a specific handicap for playing against for example katago-micro.
I guess one would need to filter out some games with premature resignations or maybe filter for ranked games.
So we could deduce what handicap does player with rank x needs for an even game. And see if it is indeed (9 - x).
You could define consistency at different levels of play by the Elo width of ranks (where ranks are assumed to be determined by handicap, as in n ranks difference can be compensated for by black getting to play n moves before the game starts with black’s turn and white getting komi, up to about n = 15).
Determined from EGF historical winning statistics, ranks around 15k are about 50 Elo wide, ranks around 1d EGF are about 100 Elo wide and ranks around 7d EGF are about 250 Elo wide. Going into the pro range, you get ranks (not pro ranks, but handicap ranks as defined above) of about 300+ Elo wide.
At perfect play, the Elo width per rank would approach infinity (or perhaps a finite, but very large value, due to the fact that score is an integer value instead of a continuous value, so komi handicap increments of less than 0.5 points are meaningless).
So you can fit an asymptotic curve through those Elo widths derived from winning statistics at different levels of play, to get an estimate for the highest rank possible = perfect play.
Using this method on EGF historical data, I arrived at an estimate of 13d EGF for perfect play, from the blue curve in this Elo width per rank graph, which is used by the EGF rating system since April 4th 2021 (red curve is what OGS uses):
Vertical axis is the Elo width per rank.
Horizontal axis is EGF rank expressed like internal OGS rank scale:
- 0 = 30k
- 10 = 20k
- 20 = 10k
- 30 = 1d
- 39 = 10d = more or less highest level achieved by humans like Go Seigen and Shin Jinseo,
- 42 = 13d = max rank in EGF rating system ~ perfect play?
I looked for katago-micro in the 27M games dump.
On 19x19, we have only 86132 finished games.
51 of them are ranked and they are all against other bots.
|60b Katago 1 playout||3|
Katago-micro’s rank is mostly 38 (9d): 59038 games out of 86132 .
Out of 86081 unranked games, 21709 are even games.
katago micro won 20522 of them.
Strongest opponents are:
- hjkl123, won 207 games out of 227 (91.6 %)
- kkxxcake, won 86 games out of 409 (21%)
Out of 64372 handicap games here is distribution of handicaps:
Here is breakdown of opponent’s rank, which spans from -8 (25k) to 44 (9d+)
I’m starting to feel disoriented.
What’s next step?
Are there any games (ranked or unranked) of it against known professional players? (like players with green names). If there are, maybe they can be used as anchors.
And If there are not enough, maybe expand to players who play with these pros the most and used them as secondary anchors and find how many games they also played against katago-micro.
I was thinking some kind of scatter plot of handicap given vs rank ( of katago’s opponents) in each of the games, and maybe colour wins one colour and losses another (red and blue ? or maybe #d55e00 and #0072b2 which I think are the colourblind/accessibility theme colours from the main site ). I would hope to see some kind of approximate straight line or more likely a curve through the data which divides wins/loss (except some abnormal data points) which one might be able to imagine is an appropriate relation between rank and handicap received by katago.
That is some kind of curve f where f(rank)=handicap. There’d be many such curves (if one exists) with discrete data points, but it doesn’t really matter how it interpolates between the non integer values.
That’s an intuitive guess which could be way off, but still might be fun to visualise.
Super interesting, thanks.
more or less highest level achieved by humans like Go Seigen and Shin Jinseo,
I find it strange to present Go Seigen and Shin Jinseo at the same level. I’d be surprised if a time-travelling Go Seigen had a chance against Shin Jinseo.
Thanks. Similar to what shinuito proposed.
For each (relevant) handicap produce a plot with y-axis= winrate, x-axis= rank.
So each y value would be computed average win rate of all games of players with a specific rank.
Maybe, but Go Seigen was ahead of his time. Michael Redmond mentions several times in his AlphaGo review videos that AlphaGo’s playing style reminds him of Go Seigen. Go Seigen was a go genius and an important contributor to the new fuseki development in the 30s, adding 4-4 as a viable opening move in even games. So I think he would be quick to catch on to the hypermodern opening.
So I guess if Go Seigen at his peak would be time travelling to play against Shin Jinseo, Shin Jinseo would (initially) have some advantage in the opening (because of his extensive study with AI), but I wouldn’t put my money on Shin Jinseo if they played a jubango with 9 hours per person per game and Go Seigen getting a minimal handicap like sen-ai-sen.
First, this oc doesn’t work for an individual player, only on statistical average of several players (and several points on the strength scale). Then, once you have a numerical inconsistency measure (in stone scale), then its percentage change between neighbouring ranks should give a distance estimate.
So if inconsistency changes, say, 10% between 3d and 4d, this means rougly 10 stones from 3d and perfect play (where it reaches 0).
You could express inconsistency in terms of average point loss per move.
If we take the value of a handicap stone as about 13 points (2 x komi), and consider that in a typical game both players play about 130 moves (when played until scoring), one rank difference means a difference of about 0.1 points in the expected average point loss per move.
So if perfect play is about 13d (EGF), the expected average point loss per move would be
- 0.0 points for 13d
- 0.5 points for 8d
- 1.0 point for 3d
- 1.5 points for 3k
- 2.0 points for 8k
- 2.5 points for 13k
- 3.0 points for 18k
- 3.5 points for 23k
- 4.0 points for 28k
How do you measure average point loss per move though?
What does percentage change between neighboring ranks mean? what’s 10% between 3d and 4d, what’s the numerator/denominator here? And why is 10% means 10 stones to perfect play?
To some approximation, you could use a strong AI to determine that, such as KataGo.
But you’d have to account for the errors in the AI evaluation. I would expect that the average evaluation error (per move) of a reviewer is more or less the same as the average point loss per move (distance from perfect play) the reviewer has when they play themselves, which for AI depends on the number of playouts/move.
My guess is that KataGo’s average point loss per move and evaluation error per move would be somewhere between 0.1 point at 10^8 = 100 million playouts (~12d level?) and 1.5 points at 10^0 = 1 playout (~3d level?).
The level IV OGS review by KataGo is in between at 10^4 = 10 thousand playouts. If it is about 8d level, it may have an average evaluation error of about 0.5 points.
I’m not a statistician, but I think it should be possible to calculate an estimation of the level of a player from KataGo game reviews, with error bars depending on the number of moves evaluated and the average evaluation error of KataGo (depending on the number of playouts used for the reviewing process).
Fortunately it doesn’t matter what inconsistency measure you use, as long as it’s reasonable and pointwise linear. You can use sth derived from average point loss as @gennan suggested, or some kind of root of variance, whatever.
Then if 3d inconsistence turns out to be 10 potatoes, and 4d inconsistence is 9 potatoes, this is 10% change and 9 more steps before zero.
@jannn @gennan you both assume current AI’s point estimation itself is accurate enough for evaluating humans as well as to themselves, but I’ve seen enough different AIs like Go galaxy and Fineart’s suggestions and their estimations, they are already several points off between each other, and KataGo haven’t even won any AI vs AI tournaments.
What’s if the distance to perfect play is far off than we thought and AI is as weak as a human kyu player compare to a human pro, can we even trust the answer it gave us? You are using AI and human gap estimation to extrapolate outward, and then use the estimated target to count backward again to verify the answer. It’s like seeing a mountain on a horizon that we have no idea how far away, and then see a guild close to us just a few steps ahead. And the guild can tell you he is very confident how many steps he is to you, but couldn’t tell you how many steps it’s going to take to get to the mountain, and we just use the measure between us, and project the guild’s height and shadow compare to the steps and say we project the mountain must be just a wall not far from us.
Note that as I first mentioned you can get an idea of inconsistency even without bots, just by comparing winrates 2d vs 1d to 4d vs 3d (or at other points). The math to quantify this stonewise would be harder but still possible.