Weird evaluation

astore · June 23, 2025, 12:50pm

The best AI move is -7.5 here. It’s sort of strange to me. Does it need to be close to zero? Does that look right?

shinuito · June 23, 2025, 1:07pm

I think it typically happens in sharp positions and with low playouts.

The higher the playouts the less likely it is to happen, but it still happens.

Essentially at this move in the game the AI estimates a particular score by looking at a few different move options.

Then one move further, when it puts all of the playouts into the position after the blue move it finds that score is off by quite a bit.

So that’s a level II review with 1000 playouts, and if we run a level IV on the game which has 12k playouts for comparison, it seems to drop down to -4

I think if you could run it for even longer, typically the estimate before the move and after the move would converge, and so the difference would be much closer to 0.

astore · June 23, 2025, 1:15pm

Ok, I understand! The low playouts analysis is the problem, but theoretically, I’ve got a Hane tier with 1000 playouts.

triangle_fuseki · June 23, 2025, 1:31pm

on AI Sensei blue circle is always “0” and any other move is written compared to blue circle
OGS writes something less useful

shinuito · June 23, 2025, 1:41pm

Let’s bear in mind that I only have the free tier low playouts AI sensei.

AI sensei might claim that the blue move is zero, but when you look at the evaluations:

what you end up noticing is that the evaluation still changes.

The score after move 58 is 13.8,
The blue move is played which is claimed to be -0
The score after the move is played is 15.5 (so it’s kind of a lie)
the next move is claimed to be a 2 point mistake and we see the score change to 17.5 which makes sense.

So AI sensei might claim that the blue move loses or gains no points, but actually it does, it’s just not displaying it to you properly.

Edit:

I also find it very hard to parse on AI sensei, because it looks like the move isn’t on the board but the score being shown is based on that move being on the board. It’s a design decision

Samraku · June 23, 2025, 2:06pm

I think the OGS one is more honest, so I like it

triangle_fuseki · June 23, 2025, 2:11pm

they just use different definitions and both use kata

triangle_fuseki · June 23, 2025, 2:12pm

it does not compare to previous moves, it compares moves on current position

shinuito · June 23, 2025, 2:22pm

I don’t understand what you mean.

You can see above that when a mistake is played like move 60, it says it loses 2 points, and the score changes from B+15.5 to B+17.5 which is consistent with a move by White that loses 2 points. It also says -2 on the hover.

My point is it doesn’t do this for the blue moves: whether they gain or lose points it always says 0.

If you prefer that behaviour then sure, but it doesn’t align with what is actually happening with the changes in score.

triangle_fuseki · June 23, 2025, 2:27pm

when something “loses 2 points”, it does not mean that “black had 17 points and then black has 15 points”
it means that if you play this other move, you will get score 15 on next move
and if you play blue move, you will get score 17 on next move
(for example)
it compares 2 possible different futures, it does not compares past and present

shinuito · June 23, 2025, 2:54pm

Sure, it does in theory compare everything to the best possible play.

There is a lot of situations where it does mean that, like if you could take two stones and then you play a move worth 0 like a dame or a pass. You just lose 4 points that you could’ve had and should have.

You could in theory say that all the score estimates should be the same as blue moves estimates. That’s probably true on really high playouts but it’s not always the case.

Let’s say we’re in position A, and most but not all of the playouts go into the blue move. Some small weighting will be given to the score estimates for the non blue move that could affect the output.

Then in position B where the blue move is played in position A, all the variations and playouts are now being spent to evaluate only the branches where this blue move has been played. In theory it could be looking one move deeper into variations, so it can be more accurate than one move before.

That depends I think on whether you imagine considering analysing a position statically vs trying to analyse a whole game.

If a game has been played there is a past and future that you can compare to. That can have blue and non-blue moves.

If the game has no past or future to compare to, then sure, the only thing you can really take abstractly is the score of the position which is going to be weighted more toward the blue move score.

That’s only true with enough playouts, where the score of the blue move in the current position doesn’t change if you move to analysing the position where the blue move has been played.

The silly example is when you have a ladder and you can only see a limited distance ahead (which maybe doesn’t apply as much to katago but let’s say LeelaZero etc). You might think that the best move is to espace the ladder if you can’t see the end of the ladder, but when you move one move ahead and you can see the end of the ladder you realise you can’t escape. So escaping on the last move was bad, but you only realise it one move ahead.

At least that would be my understanding of it.

shinuito · June 23, 2025, 3:03pm

The most dramatic example might be in really complicated games like Lee Sedol’s broken ladder game.

Let’s again use a low playout katago like AIsensei free AI Sensei | Hong Jansik vs Lee Sedol

From move 75 to move 96 they’re all blue moves but:

the score changes from B+9.4 to B+35.6 with only blue moves.

It basically makes no sense with low playouts to say that both a blue move loses or gains 0 points and at the same time, the score estimate has changed by 25 points.

That’s what I’m trying to get across. Playouts are limited because ai reviews need server time and running servers costs money.

The effect of limited playouts means that the score estimate isn’t always accurate, so trusting the blue move to be 0 point loss doesn’t make sense, if you can’t even trust the score estimate anyway.

triangle_fuseki · June 23, 2025, 3:06pm

if we play this bad move and then next AI move, score is:

if we instead play AI move and next AI move, score is:

43.8 - 16.8 = 27
that is where got from

shinuito · June 23, 2025, 3:15pm

If you can link your AI sensei review it would help so we work with the same numbers if we want to focus on that game.

But let’s say I look at your example with move 74. For me on the free version move 74 is a -26.6 mistake

f

if you want to know where that -26.6 comes from, simply check one move before.

The score was B+17.5 after black’s ai move of move 73. I showed above that the tooltip/hover explains this.

Then white makes a 26.6 point mistake with move 74 which brings the score from B+17.5 to B+44.1 which is the score after White’s move 74.

44.1-17.5 = 26.6

It’s genuinely the difference in score estimate before and after the move.

Now you might find that it should be approximately the same as something like

or something more convoluted, but
A) that’s not even as close - you’re off by 0.2 in your example, and
B) in positions where the blue moves lose really close to 0 points, then you can play long strings of blue moves without changing the score and pretend like that’s the move or method you should compare it to.

Anyway I think AI sensei’s design choice was bad, because it’s not at all easy to interpret it the way it is.

Anyway, I want to stress that:

shinuito · June 23, 2025, 3:47pm

It is possible that the way the AI sensei code is getting the values to display is something like

game move score estimate - blue move estimate

and that’s why the blue move is always zero. I can’t say that that’s not what’s happening.

But I think if you can’t guarantee that the blue move gains/loses 0 points, then it doesn’t really make sense to say that it loses 0, even though you can choose to say it.

triangle_fuseki · June 23, 2025, 3:54pm

that is what I meant from the beginning

shinuito · June 23, 2025, 4:03pm

Sure, but I’m suggesting that it isn’t more useful, given the many examples above, to write that the blue move loses zero, and then on low playouts have the score change all the time when the blue move is played.

ronin3 · June 23, 2025, 7:13pm

I think it’s depressingly ironic that we went from not knowing what we were doing to using these fantastic new tools - but without knowing what they are doing or what these evaluations mean.

shinuito · June 23, 2025, 7:46pm

I mean we know what they mean.

Katago explains what its parameters mean (to a certain extent) here

github.com

lightvector/KataGo/blob/master/docs/Analysis_Engine.md

# KataGo Parallel Analysis Engine

KataGo contains an engine that can be used to analyze large numbers of positions in parallel (entire games, or multiple games).
When properly configured and used with modern GPUs that can handle large batch sizes, this engine can be much faster than using
the GTP engine and `kata-analyze`, due to being able to take advantage of cross-position batching, and hopefully having a
nicer API. The analysis engine is primarily intended for people writing tools - for example, to run as the backend of an analysis
server or website.

This engine can be run via:

```./katago analysis -config CONFIG_FILE -model MODEL_FILE```

An example config file is provided in `cpp/configs/analysis_example.cfg`. Adjusting this config is recommended, for example
`nnCacheSizePowerOfTwo` based on how much RAM you have, and adjusting `numSearchThreadsPerAnalysisThread` (the number of MCTS threads operating simultaneously on the same position) and `numAnalysisThreads` (the number of positions that will be analyzed at the same time, *each* of which will use `numSearchThreadsPerAnalysisThread` many search threads).

See the [example analysis config](https://github.com/lightvector/KataGo/blob/master/cpp/configs/analysis_example.cfg#L60) for a fairly detailed discussion of how to tune these parameters.

## Example Code

For example code demonstrating how to invoke the analysis engine from Python, see [here](https://github.com/lightvector/KataGo/blob/master/python/query_analysis_engine_example.py)!

This file has been truncated. show original

For example

scoreMean - Same as scoreLead. “Mean” is a slight misnomer, but this field exists to preserve compatibility with existing tools.

scoreLead - The predicted average number of points that the current side is leading by (with this many points fewer, it would be an even game).

and the scoreLead is a mixture of Katago’s neural net that has learned to predict who is winning and by how much, but updated as some kind of weighted average by results from the tree search it does.

I mean I don’t understand the fine details and the code, but you can have a rough idea what’s happening.

We know points are points, and in some simple cases it’s relatively easy to see why you might lose a point or two here or there.

In other cases it’s much harder to see exactly why one move is 3 points better, but we wouldn’t have any good method to estimate it anyway ourselves.

On the other hand, when someone makes a closed sourced app, unless we have them explain exactly what’s happening in the background, then of course we don’t know exactly what they’re doing, we just have to take a best guess.

Jon_Ko · June 23, 2025, 9:47pm

I think both approaches are valid.

Considering the last move’s score lead prediction as 0 and showing how the suggested next moves change that, tells the user more about why the lead prediction is as high as it is. That is important for judging the whole board situation and how comfortable the lead of one player is.

Considering the best next move’s score lead prediction as 0 and showing how other suggested moves compare to that focuses on where to play next and how bad it is to deviate from the move KataGo ranks highest.

Another approach would be to show the predicted score lead just as KataGo returns it. If you play here, you’re X points behind/ahead. It might be a bit hard to compare moves if one player has a big lead, because then more digits are shown on the screen, which is harder to process for our brains.