On OGS vs IGS rankings

That giant landscape would be very complicated, but with perfect knowledge of the game, it would be possible to assign a definite numerical minimax score to each point (game position) in the landscape and use that as the vertical elevation of each point.

I don’t think of this landscape as having a central winning peak that you can just move towards. You are not alone on the go board. One player (min) tries to go in a direction that makes the the trajectory go downhill to a nearby valley, while the other (max) tries to go in a direction that makes the trajectory go uphill to a nearby mountain. But when neither player makes a mistake (going in a direction that helps the opponent), all they can really hope for, is to keep the trajectory at the same elevation.

The reason we can’t actually do this landscape mapping is not a problem in principle. It is more a technical problem: The universe is not big enough to contain this much data.

But we may get arbitrarily close with bigger and bigger computers in future eaons. (In fact, current chess engines can already play perfect endgames without any reading, by looking up precalculated minimax scores from a large database of board configurations with up to 7 pieces on the board).

Let’s assume for now that we have an oracle with access to a universe big enough for all that data, that can provide us with the minimax score for any game position we ask it about. With such an oracle available, making a perfect player would be a piece of cake.

When one of the players in a game is perfect, the minimax score would never change from their move (the don’t make mistakes). When both players are perfect (and komi is perfect), the game trajectory never deviates from elevation 0 and the game always ends in jigo.

A game between imperfect players would also follow some trajectory through that giant landscape, but the elevation would vary above and below 0 along the trajectory, as the lead of the game shifts back and forth between the players, because of their mistakes.

A game trajectory can be projected to a 2 dimensional graph with minimax score on the vertical axis and move # on the horizontal axis (like OGS AI review graph, although a perfect review graph would be perfectly accurate, while an AI review graph is not quite as perfect and needs some error bars).

From such a perfect review graph you can extract the point loss from each move, and by some statistical aggregation of that data (such as a simple average per player), you can reduce that 2-dimensional review graph to one number for each player.

Regular rating systems reduce this data even further. The game trajectory through the landscape is totally ignored, except for the game result. Regular rating systems are agnostic about this landscape and even about which game is being played (go, chess, tennis, rock-paper-scissors,…)

I don’t really understand what you mean by updating / rearranging the landscape. The shape of the landscape is fully fixed. The players are only creating the route of their game through the landscape by the moves they play.
Handicap games have a different starting point than even games (the board position is not empty and the start elevation is not 0). But they are still just board positions with a fixed minimax score, like any other position.

1 Like

I suppose that positions exist where more than one move have the optimal minimax score. Annotating each position in the game tree with additional attributes (such as downstream complexity, narrowness of the optimal path for the opponent, proximity to heavily advantageous game tree sections) could make the perfect player stronger in (handicap) games against imperfect players.
But as you’d add more and more optimizations like these, I think the definition of a perfect player and the distinction between a “go god” and a “go devil” becomes less and less clear.

I prefer to use the simplest, most minimalistic perfect player as a reference point. One that decides by looking only at the minimax scores of branches leading out directly from the current position. And when multiple moves in a give position have the same optimal minimax score, choose randomly between those. It’s a pity that such a perfect player wouldn’t be deterministic anymore, but as it’s hypothetical anyway (in our underpowered universe at least), it is probably a moot point.

Aren’t there already 5-10k bots playing on OGS? Or do you mean something else than those?

I don’t follow you here. We are talking about uncertainity CHANGES (in actual play strength - not rating) between neighboring ranks, not uncertainity itself. And an 50% change would mean only 1 more stone to perfect play (1 stone took half the uncertainity, the next stone will take the other half and you’re at pp already).

I also don’t see where 50% comes from - if you mean the possibility that a player’s rating is nearly one stone off, that doesn’t matter, since we only think about statistical average of several players, individual inaccuracies doesn’t matter much.

Random selection among the best-scoring moves is a good idea, but probably still extremely non-optimal in losing positions against humans. Can you possibly win handicap games against SDKs at more than 4-6 stones without doing anything even a little bit risky?

I don’t know, is there a list of bots somewhere? This one seems pretty close since the software version is specified clearly:

But look at its crazy rating fluctuations!

I wonder if the CPU is getting overwhelmed or people are exploiting weaknesses. Perhaps it would be better to limit the number of ranked games it plays against any given opponent.

1 Like

I wonder if mixing different board sizes and time settings on a single rank is as “ok” for bots as it’s for humans.

Humans will use the extra time given on correspondence games for example and maybe bots are much better at raw calculation at any single level (smaller boards) than humans.

Human players are not perfect players. The accuracy of their knowledge, reading and evaluation is always limited, so humans need to take some chances every now and then (more so when behind and less so when ahead).

On the playing style scale from “go god” to “go devil”, humans are somewhere in between. Maybe more like a “go god” (playing proper and reasonable moves in any position) in a teaching game, and more like a “go devil” (using every dirty trick in the book) when blitzing against a rival.

As for how I play in higher handicap games against SDK, I tend to think I play fairly calmly in handicap games and many end up close enough to go to scoring.
I cannot claim to take zero risk, because I’m far from a perfect player, so taking some risks is unavoidable. Also in even games against opponents of my level, I take risks when my positional judgement tells me I need to do it to avoid losing. It doesn’t matter if I cannot read it all out, when not doing it means I’ll surely lose by points.

If a human player takes no risks at all, their play is usually too passive, which is usually not good enough to win against an opponent of similar skill.

It’s probably not very relevant for this general topic, but maybe you’d like to judge for yourself my claim that I tend to play fairly calmly when giving higher handicaps to SDK players. Here are two correspondence handicap games I played against OGS 5k:
7 stones Tournament Game: Test your handicap skills! (Round Robin version, Invitational) (74824) R:1 (gennan vs yebellz) (I lost by 10.5 points),
6 stones Tournament Game: Test your handicap skills! (Round Robin version, Invitational) (74824) R:1 (gennan vs teapoweredrobot) (I won by 19.5 points)

1 Like

Ok, what do you mean by “uncertainty CHANGES”? As my original question, what exactly is the numerical number in the denominator against what number for you to get a 10%? You just keep replacing one concept to the next, stone difference, rating, now actual strength, etc. please specific where exactly can we get numbers from these “things” or this discussion will go no where.

I can only repeat myself:

Those are two possible candidates for “inconsistency”, but others may be possible as well. The only reason winrates vs 1 stone weaker opp is not directly usable is that those numbers would need to be transformed first to the stone or point scale.

(From a less numeric view, inconsistency is how much better a player plays on a good day (or in good games) than on a bad day. These differences are known to decrease proportionally with increasing strength.)

I think you misunderstood about “the winning peak”, the “winning regions” are everywhere in the landscape, and since it’s pure theoretical, the size of the observable universe has nothing to do with it.

In the original state space landscape, if you define the link of one legal move to the next to be a certain unit, they can be placed in hypercube like fashion and fixed. However, as topology and graph theories, they can be morphed without losing the connectivity edges between nodes, just stretched. When a higher dimension object projected to a lower dimension for the rating system to work on winning/loosing, score, stone difference, all the parameters in a rating system, for the linear rating to work in the parameter space, it is just as I described - a projection. And since the ratings of every player is recalculated after every game, a new projection need to be made. If we just observed the parameter space, it would look like the landscape morphed. The rating system doesn’t need to encompass the whole web of landscape, they just include all positions and from everyone in the history of games used in the rating system, a partial graph, where it was just a tiny fraction of sparse space of the whole state space.

What I mean about the checkpoints is that they are the players themselves, like beacons across the landscape, outside the node so to speak. Think of a person’s record as a personal partial graph, only encompass positions they have played and being used (and relevant to the current calculation) in the rating system. The final positions are marked as win or loss, and the person has a rating score. Any point on the web is an aggregate of the players traverse through them, like ants leaving pheromone after they pass through a trail carrying their “rating” toward the game ending points of that person’s historical games (not necessary in settled winning regions, just the positions in the rating system’s record). If we did that, and forced player has to have a rating number, it will be imposing a unit (a fixed direct distance in the parameter space length between the players). And the personal web of nodes will have to be on the boundary between players’ checkpoints. In a sense, it is a cluster problem, each player will drift in the original state space like a cluster center in their own web. If we added the parameter space back to the state space, it will be like assigning extra rating value in the state space, and the players are anchors with numbers, and each point of a positions getting pheromone value by being associated with that player.

So a more well-defined final question is that if we just extrapolate the tiny fraction of the web used in the rating system to the whole full map (the practicality doesn’t matter). What would it be look like. Will the extrapolation values outward trustworthy? Since it is a dynamic process, every time a new game is added into the rating system to update it, what will be the roughness after the updating change the extrapolated outward region? If we have AIs that do some jobs to explore at least the nearby regions for us, then we can test and see how the extrapolation outward actually met with AI’s estimations and their own estimation against one another would be of great help if we can subject different AIs (not just the same AI against itself) as different players with their own version of personal web to positions in their own historical game records, we can then see how their variation measure up against one another, and how reliable the extrapolation from our rating system really is.

BTW, if we line up the player beacons like in a circle in acceding order, and if the web is projected to a 2d plain, then it could be arranged like a wizard hat shape. And the game nodes will be spread amount the player beacons, and the entrance node wouldn’t be in the outskirt, but in the middle halfway in the slope, near the weight average player’s rating. And winning regions can be on either side, not just moving inward. Imagine an end position from two very low rating players, they would reside far outward, but the trail of the game start in the middle moving outward. The center region will be mostly left blank and unknown as no one really sure what the game would look like near there (not even AI has the strength to explore yet)

I mostly play teaching games or handicap games with reverse komi, it’s more general and can teach all aspects of the game from fuseki to yose. And I’d say the best strategy is probably just playing it as a normal game, fight as hard as I can be, maybe emphasis more toward influence and higher positions. Although, I feel it might have contributed a little with my hyper-aggressive force push fighting style in mid-game.

@gennan @jannn
I do wonder though if the extrapolated score difference along with reverse komi really is generally believe to be one stone equal to about 6, 7 points? It is what I believe, and are there enough AI reverse komi games to confirm this? If there is a high correlation of score different, reverse komi might be the key to link the illusive traditional idea of handicap stones to what AI believe to be the estimated final scores and thus able to get extra dimension to the rating system, not just win/loss.

I believe there is a rating system, put score difference into consideration - Point Ranking Scheme At Tokyo Go clubs at Sensei's Library A combination of fine-grain komi with handicap stones. I wonder if it can be applied to reverse komi as well? Will it get a more rational and fast converge rating?

:thinking:

This could be an issue though:

Since strong bots are 9d or 9d+, that task will be up to dans and 1kyus, if they’re willing to…
I’m out! :grin:

1 Like

KataGo gives estimates of komi value for different stone handicaps. See Handicap Games - #8 by gennan (those estimates were made about 18 months ago, newer KataGo versions may give slightly different results). So stone handicap and komi handicap are more or less interchangeable for AI (although individual players may differ in their ability to make good use of handicap in the form of stones vs handicap in the form of komi).

So to quite a good approximation, handicap stones are worth ~13 points according to KataGo (except for the first one, which is worth ~6.5 points in traditional handicap). That is with Japanese rules.

Under Chinese rules a handicap stone may be worth ~14 points (except fot the first one, which is worth ~7 points), but I didn’t test this with KataGo.

This equivalence between score, komi and handicap means that (to a good approximation) komi and handicap readily fit into the minimax score annotation of each board position in the game tree, avoiding the need for extra dimensions in the landscape. 3 Dimensions are probably enough for the annotated game tree landscape: 2 horizontal dimensions to spatially separate positions from each other and 1 dimension (elevation) for the minimax score of each position (which would include komi and handicap).
You could even collapse the 2 horizontal dimensions to 1 horizontal dimension by using some hash coding scheme (which is how transposition tables in chess engines work).

With this equivalence between komi and handicap, you could make handicaps much more fine-grained with increments as small as 0.5 points, instead of (or in addition to) the large 13-14 point increments of full handicap stones.
Especially at the highest levels of play, you probably need to use high resolution handicaps to accurately measure rank gaps between players (by determining the handicap required to achieve ~50% win rate over a long series of games). This komi method would give a resolution of 3.6%-3.8% (1/26-1/28) of a rank for (komi) handicap increments of 0.5 points.
You could even achieve an arbitrarily high resolution by determining the fraction of games of adjacent (komi) handicaps needed to achieve 50.000…% win rate over a long series of games.

1 Like

I just had an unranked 9 stones handicap game against katago-micro.
I won.

So one of the following must be true:

  • I’m dan
  • it isn’t 9d
  • 1 rank to 1 stone conversion doesn’t work
1 Like

I think katago-micro plays ranked games against other bots which don’t play well with handicap, so katago-micro is probably not 9d.

1 Like

It probably makes sense. I doubt the 7d bots are really 7d either, probably the 6d and 7d humans don’t play them.

So they might keep going up in rank only a stronger bot like golaxy beats them back down :stuck_out_tongue:

1 Like

The discussion about state space and projected landscape probably deserves a separate thread, and I think transforming the perfect play problem into finding the shortest route in graph in projected plains is a tool that the theoretical analysis haven’t utilized enough yet, since trees in any tree search are all just part of the full state space web graph.

And from my understanding of how AI networks work, is that they are just complex matrix multiplications, which in theory is exactly the same as projection, mapping into higher dimensions, and then project it back to lower dimensions in parameter space (predicted score difference, predicted winning probabilities, etc.) But there is no reason we couldn’t just bypass the mapping into higher dimension part through blackbox AIs, and just work on projecting the full graph into lower dimension first (not into parameter space, but keep the equivalent topography), and operate on them instead. (like real world mapping, traditional map projection already did a good job for calculating distance and finding the shortest path on curve uneven surface in a local area without the need of GPS to get to the details in meters)

I find the increase of players (and people try to analyze the games) blindly trust the output of AIs unconditionally quite troublesome. These parameters and measurements might have correlations, but they are not likely such a simple linear correlation. Not to mention a parameter “komi” that literally changing the winning/losing positions and area as part of the rule set. We probably should put komi aside and deal with measurable quantity like final scores first. Using AI projected scores/komi as part of the analysis is the issue I was repeatedly trying to bring up - get a better look and understanding of the measuring tools, before using it to measure further and get an early conclusion.

About handicap stones and scores, we can just use an almost solved board size like 7x7 to check if the assumption of linear correlation between scores and handicap stones would be like. From what I read, most now agree the final score difference in a perfect play in 7x7 is 9 (using area scoring rule, black gets the final dame). But what about 2 handicaps? My guess is it is probably 28 or 29, since a minimum living space on the corner and side is both 10 area size if black occupies both 3-4 points in the center ((49-10)-10 = 29 white might have a chance to get the final dame, so could be 28). And for 3 handicaps, it is pretty clear it would be 49, white has no chance of living. So the “fair game” would end up with 1, 2, 3 handicaps as 9, 28(29), 49. This simple analysis gave us a very clear clue that handicap stones definitely not a linear correlation, the higher the handicaps to a point it would be no space for white to live in a perfect play, and right before that, it was more about the minimum living white group size than the black living group size, and how they are split. Hence, the handicap to final score relationship is a gradual curve toward no living space. And under the unequal strength of players, we might need to be even more careful in jumping into conclusion to equate strength with handicap stones. As I said, if we can analyze millions of different AIs vs different AIs games with all types of free handicaps and their final scores (get out of the car and get a ground marking) by forcing them to scoring, then we might have more confidence in their measuring “strength” and how consistent they really are.

2 Likes