Top 3 moves: Score a better metric?

I mean I could probably come up with some examples where highlighting a 10 point ish mistake and the ai variation won’t be helpful, but I imagine in the same game, some basic metric on winrates and scores won’t be too simple to conjure up which will find better key moves ( by simple I mean things like first difference as opposed to “weight the last ten moves in such a way, and consider applying some nonlinear function to each score”)

Right, this is the kind of thing that a human teacher can do that we can’t (yet) substitute with an AI :smiley:

But a human without a teacher can still get better at interpreting and using the AI in a productive way. When a move is -1 point, sometimes there’s an easy to understand reason why, and you can figure it out by yourself or by looking at some different continuations from the AI (and if it’s a common shape you’ve just learned something very useful). That’s the kind of thing that you will totally miss if you only look at the “biggest” mistakes, and reason that smaller stuff doesn’t matter at your level (I’m definitely guilty of this myself, by the way - thinking for yourself is hard work, so it’s very tempting to take the lazy route of just staring at pointlosses).

Anyways, the best way to review with AI is a different topic. I basically just wanted to say “point difference is a better metric for key moves than winrate” and then I felt compelled to clarify that using point difference isn’t perfect, but it’s about the best we can do with current technology. Looking at the biggest point losses throughout the game is certainly a good starting point when reviewing, so this feature would be very much appreciated.


I also agree with you that there probably is way more things with smaller point losses where it will be easier to find something to improve on. The problem is that we don’t know out of the likely 100 1 point losses or possibly as many 2 point losses etc which one to randomly choose from for it to be actually helpful for the player to learn from.

It’s just that if one is only giving a top 3 or top 6 or top 9 game changing moments in some way (and only those variations for non-supporters), then I’d prefer it did that (a) for the biggest score shifts even if they don’t change the winrate (for the reasons mentioned before, there probably the game changing mistakes rather than the 1 point mistake that just tipped the score from 0.5 to -0.5) and (b) an equal number for each player ideally.

If someone gets hammered likely they’ll have all the big score losses, but that’s not to say the winning player didn’t make mistakes or have something they could learn also.

Maybe there’s other suggestions though on whether there should be some other subdivisions that could be interesting? Like guarantee a mistake from the first 50 moves, or last 50 etc?


Many thanks! Examples are exactly the thing I need!

I will note though… this is a handicap game, so it still supports my hypothesis that Δ(score) is most helpful in handicap games.

EDIT: But I didn’t have to go back too much farther to find another good example: qnpnpmqppnp vs. kJames :grinning: lots of swings in these games!

Duh… Sorry - But as you noted it is a very common occurrence in my games, so examples are easy to find.

but it’s about the best we can do with current technology.

I doubt this is the case. You can do better already with only the information already available, it just needs a one person who have a bit of practical coding knowledge to write a script to auto-annotate some games according to some possible formulas, and a few strong kyu and dan players willingness and time to extensively playtest a variety of formulas and see what produces best results in practice.

For example, I suspect that something roughly like this would be better (or similar formulas - this is where the playtesting comes in):

Let P be the move the player made. For each move M that got at least 15% of the number of visits that the the AI favorite move did, compute:

  • (score(P)-score(M)) * (raw policy of M) * (1 - raw policy of P)

Take the minimum of all values computed. Use this in place of Δ(score) for the purpose of sorting and identifying mistakes. For reference Δ(score) would simply be score(P)-score(M) for the single top favored M.


  • Multiplying by raw policy of M: a mistake is more review-worthy if the bot considered the correct move to be “obvious” (high raw policy) without any reading than if even the bot itself didn’t think of the correct move at first (low raw policy) and only found it was correct after careful reading. Even though the raw policy is still going to far outstrip the typical kyu or dan player and things it considers obvious may still include lots of things that are not easily learnable for humans, the more of the things that are easily learnable for amateur humans are going to be the things that the bot considers obvious without reading

  • Taking the minimum over all M that were searched at least 15% as much: this is so that multiplying by raw policy of M doesn’t overly downweight a move if the bot’s favorite move was low policy but there was a move that was almost as good that was very high policy.

  • Multiplying by (1 - raw policy of P): A mistake is more review-worthy if the bot considered it obvious that the played move was not even a possibility, than if even the bot thought at first that the played was feasible.

Other things you might want to try:

  • Other possible variants of the above. For example if multiplying by raw policy seems too swingy on the metric in practice, multiply by the square root of the raw policy instead.
  • Multiply by 10 / (|current point lead| + 10): for non handicap most players probably are interested more in the mistakes that won or lost the game, it’s just that winrate is too sharp. But you can construct your own less-sharp function that cares more about score changes when the lead is closer to 0 than when the game is totally lopsided. Change 10 to some other number to vary the sharpness.
  • Maybe also use the reported scoreStdev to scale the sharpness: similar justification. This number conveniently is high in the opening and low in the endgame. Again maybe a human game the score swings way more than in KataGo’s games. But that doesn’t stop you from just multiplying this number by 2 or something and using that, better than using nothing at all.
  • For these formulas, try pulling the raw policies out of an older and smaller network whose understanding may be closer to amateur level, while still using the evaluations (more accurate) from the newer networks.

There’s no way I’m getting to this soon myself, but there’s no reason I’d be better at playtesting any particular formula than anyone else, so anyone else: go try stuff out. If you find something really good, I can make it an official metric in KataGo in a future release, or it can at least inform what metrics are added in the future when I do work on this stuff more.


I’ve not actually looked at the top three moves since becoming a supporter because I know they’re not much use in general. Now to be fair the first game I looked at it did pick the big score losses. However, for example this game

The moves it picks are 1-2 point losses, when the Black group actually just up and dies in the bottom left. I think that’s probably a key move, over those 1-2 point mistakes. More specifically move 71 is -9 points, move 73 is -3, there’s 4 and 7 point mistakes to follow, and then a -11 on move 79.

Now I can understand that it might not be completely useful for nonsupporters if it’s showing too many of the key moves, which are close together. Like maybe it’d be not so useful if move X was a big mistake and then move X+1 was equally big, and basically they just end up showing the same variation.

Maybe one could do something like this, or guarantee a Y move gap between mistakes of the same color. I’m hesitant about that though, since that might not be what people would want out of a game.

How many games do you want linked by the way @benjito ? As many as I can find where the top 3 moves maybe don’t match the biggest mistakes or?

On a separate but unrelated note, what would you think about adding a table/modal/popup type thing that one could click where it summarises the AI graph? Something like lichess and have?

It’s not necessarily the most useful, depending on what kind of categories we decide to group things into but it would certainly give people a nice overview to compare game to game.

Something like groupings according to katago – Excellent (bot moves with small score loss), Great (bot moves with larger score loss), Inaccuracy (small score loss <1,2 points?), Mistake (3-5 or 6 points?), Blunder (7+ point mistake).

I guess we don’t necessarily have a centipawn loss, but we could give an average point loss per move sort of thing, or something like that.


That would be great indeed, I’m all for it.

A lot of potential Go players can be found among Chess players, and those will be used to this kind of table.

But if it requires too much work I’m already happy with fixing the 3 moves thing, and unlocking the other key moves.


Honestly as many as you care to share! It helps me to verify the new algorithm isn’t wonky. That game alone just helped me find a subtle issue where scores weren’t being synced in the same way win rates were :eyes: Key moves look about right, now just have to figure out the red dots…



Yeah I like the chess stuff too! I can’t commit to adding any UI components at the moment, but if that’s a project you were interested in I could show you around the AI review code/data structures :slight_smile:

1 Like

Sure I do a pass through a few games

Ladder Challenge: shinuito(#32) vs Bhsd(#22) - I think the current one gets a few key points, and maybe it could be argued in a sense they were the mistakes that threw the game (which is what winrate does) but for mere humans, continuing to make 8 point mistakes are nails in the coffin :slight_smile:

Ladder Challenge: Aftiz(#38) vs shinuito(#27) - this ones interesting because the bottom right needs to be resolved for and doesn’t for a long time so every size of mistake is slightly bigger I would imagine than if the same moves were played but the right side was settled. So I wonder will the key move variations end up the same.

Tournament Game: Secondish Moderators Round Robin (80881) R:1 (shinuito vs Kosh) - a handicap one, where if it’s going by winrate it’s probably just picking random moves.

Tournament Game: Secondish Moderators Round Robin (80881) R:1 (gennan vs shinuito) - handicap, picking k2 as a top mistake when there’s many other things that could’ve been picked.

shinuito vs. 20bTurboLz-Elf-v1 - game vs elf bot where it refuses to kill my groups and is happy to win by 0.5 points. The moves selected by winrate are tiny mistakes by comparison to the massive score loss ones.

Summer Cup Round 5 - Similar where the key moves resolve around not playing the bottom side, just curious if the odd game like that is useful to test.

yuuurt2 - 5 stone 13x13 for fun. Naturally riddled with high score mistakes. (I can link more high handicap 13x13 games).

1 Like

Merci mille fois!

I’ll take a look at these today!

PR soming soon (EDIT: #1553), just some notes on how those games turned out.

Ladder Challenge: shinuito(#32) vs Bhsd(#22) : The 8 point “nails in the coffin” are now highlighted.

Ladder Challenge: Aftiz(#38) vs shinuito(#27) - Interestingly, the bottom right doesn’t get called out in the top three as there’s a late game center move that is worth 10 points. Still, I think that’s WAI. There’s probably a clever way to deduplicate consecutive swings like 133-135, but that’s a project for another day.

Tournament Game: Secondish Moderators Round Robin (80881) R:1 (shinuito vs Kosh) - I honestly don’t think this gets much better using score, since O13 was a big move for so long, but the original choices were pretty random as you said.

Tournament Game: Secondish Moderators Round Robin (80881) R:1 (gennan vs shinuito) - Much better with the score metric IMO.

shinuito vs. 20bTurboLz-Elf-v1 - Score metric catches the really big move (K14) that was not caught with win rate. Again, a way to deduplicate similar key moves would be nice…

Summer Cup Round 5 - Yep this is useful to test. I’m a little bummed that a move like 42 is no longer considered a top key move, but I think it’s good that the 10 point swings are picked up now.

yuuurt2 - This game had pretty much one move that needed to be played, and both reviews picked it up. I suppose the score review wins by a hair since it picks the times when that move is most “valuable”.

Overall this gives me some confidence in the update, so thanks again for providing games and commentary @shinuito!


No problem!

Yeah I wasn’t sure whether anything I pointed out would’ve been the “real” key moves. Naturally it would’ve been better for someone stronger to pick key points from my games.

But even still they probably still can and see how it differs from the new update :stuck_out_tongue:

It’ll be interesting to see if many people that use the top three moves has an opinion, or finds it better after the update.

Great work in any case :slight_smile:


My opinion is that these top 3 moves were funny idea at first but completly useless afterall.

Anyone who can read a graph at a primary school level should agree on this.

The graph is not accurate for free users. It is only a very quick, rough pass over showing on KG’s first impressions without any analysis.

The top 3 moves might have less meaning for supporters since they get a somewhat deep analysis of the whole game, but even then the extra strength that gets applied to the top 3 moves may still be insightful even then.

It is not merely a listing of the data already available in the graph, the top 3 moves always get analysed at the strongest level available (currently KG level IV) for all users.


I was not aware of these differences of analysis levels for top moves.
I still find much more interesting to follow the variations on the graph with some concern on each move and variant proposal as an extract of 3 top

Is this the right thread to +1 the point that @gennan keeps banging on about that the graph from AI review should default to score rather than winrate. Here’s a reddit thread example from confusion coming from winrate rather than score by default.


It probably should, but if you’re logged in it remembers your last setting, so in theory most people only need to change it once. But yeah score as default would make more sense… Surprised it isn’t already actually :thinking:


If either of you is interested in making a PR :slight_smile:


Yeah, useful for me and you, but the people who would benefit most from score by default are those who don’t know about its existence :slight_smile: