Weird evaluation

shinuito · June 23, 2025, 10:09pm

I’m sure they’re both valid, and to be honest likely to be exactly the same at high playouts. You’d think that the score after playing the best move should be the same as the score in the current position, or in this case that the blue move is 0 change in the score.

That said, most services can only offer limited playouts and at various pricing scales to pay for the resource costs.

So then we have to deal with inaccuracies in score estimation, which are worse in tactically complicated positions.

So then your choice is what do you do if the evaluation of the current position is X, but when the best move is played the evaluation comes back as Y?

Do you pretend that the best move didn’t change the evaluation, or do you just show it as it is, that even the best move shifted the evaluation because the board is complicated?

I mean you could try to retroactively pull back a score estimate from as many blue moves in the future backward, but that will only really work when players are playing a lot of blue moves to begin with.

Happy to also look at some examples of what we’re getting from the server and comparing it to what other katago’s are saying, like AI Sensei, Zbaduk, katrain etc.

hexahedron · June 24, 2025, 12:22am

Mmmm, I wouldn’t ever consider this to be a property that can trusted to be highly reliably true, no matter how many visits of search, at least not until we nearly solve the game. (And 19x19 probably won’t be solved in the foreseeable future)

Suppose in the extreme that it were always true.

Consider running analysis on every move until the end of the game always choosing the bot’s preferred move. For it to be the case the predicted score on each move is the same as the score on the next move, that means the predicted score must be the same on every move. So that would imply that the bot can correctly predict the end-of-game score against an equally strong opponent (i.e. itself) from any point earlier in the game. If the bot were to play itself at the komi it believes to be fair for any position, it would draw 100% of games.

That doesn’t quite imply that the bot is optimal (after all it doesn’t rule out blind spots that would get an even better result than the bot’s move), but it still does mean that the bot despite searching many moves with a bunch of visits on every turn, can’t find any improvements on its own, and at least among the moves it considers, is never surprised to find anything be better or worse than expected. That’s quite beyond where current bots are on 19x19, in practice we likely won’t see this kind of behavior until bots are much closer to optimal.

Okay, sure, that’s an extreme. It can still be decently reliable for a bot that the score on successive moves will equal if you follow the “blue move”, obviously we’re not expecting that it’s literally always true.

But the game only lasts 200-300 moves, and the self-play draw rate isn’t so high yet - we would expect a good fraction of games to have at least one or two surprises and swings, and even some of the drawn games can have also have surprises that cancel out.

That means that at a bare minimum we might expect e.g. on the order of magnitude of 1% (perhaps more in practice) of positions to have the bot be surprised and/or for scores to mismatch across a move… and even 1% is already enough that across all the games people are playing and analyzing every day, moves where the bot is surprised or misjudges something should therefore actually be pretty commonplace for people to find. Even if high-visit searches are used.

And… that’s kind of not too far from what we see, as far as I can tell. At least, it’s good to recognize that properties like reliable move-to-move score consistency can be a really big and difficult ask, far more than one might think at first.

shinuito · June 24, 2025, 1:13am

For a perfect play bot, we imagine it just knows the score for optimal play, and the best move(s), and has some kind of preference between the best or equally good moves. I think this is what people tend to draw intuition from though, in that the “best” move shouldn’t lose any points for such a perfect play bot.

I understand your point though, that in practice for a real bot, the same thing could happen if it just couldn’t find better moves for some reason or limitation. Better moves might exist but it can’t find it, so it can’t know its move was a mistake and evaluate it properly.

But maybe in the ~99% (or complement of your 1%) of simpler positions that occur day to day, katago might find an “optimal” move with enough search.

Or at the very least, let’s call it “katago-optimal” where it allows for katago having blind-spots and imperfections. (I don’t know if it applies directly though to katago since it has some settings like widerootnoise that might force it to explore more moves.)

I suppose though, for something like AISensei, Zbaduk, OGS offering some fixed visits (ish) search for every move of the game, you’re getting output like

score at each position of the game ~position[n].score
score for each branch in the explored branches ~position[n].branches[j].score
other things like standard deviation etc

and you’ve to decide what to show the user to help them decide if they’ve made a mistake or not. There’s options like:

If position[0].score is the empty board score, it makes sense to me to just prescribe position[n].score-position[n-1].score as the amount lost by playing move n (except accounting for a ± sign for the two players)
Or position[n].score-position[n-1].branches[0].score where you imagine the 0th branch is the best move returned from search (again accounting for ± sign).

Maybe I’m just not familiar enough with katago’s raw output to know whether one should expect these to be same, similar, or very different as it scales with visits and complexity in the position.

Or whether it makes sense to try to pull back evaluations from future positions, treating them as more accurate than previous positions (which maybe from the above it sounds like not necessarily in general).