How Deep is your Go?

Counting_Zenist · September 16, 2024, 8:09am

Using the 5 games between AlphaGo and Lee Sedol

richyfourtytwo · September 16, 2024, 8:35am

About 2 years ago a student asked me for ideas for a thesis-project and I proposed exactly this. (He was more intested in getting rich quickly with AI stock market predictions at the time though and in the end did neither of these things.)

Nice to see someone has done it now. The idea must be pretty obvious.

xela · September 16, 2024, 11:09am

Nicely done It returns the results fairly quickly, so I assume it’s just using policy values network values or similar, not doing any playouts? Can you share any more details of the method, or do you need to finish the thesis first?

JethOrensin · September 16, 2024, 11:54am

I do not know if this test will help in your calibrations, but I put in my last 9 ranked games and got this: Player rating: 1796.68 (1.5-kyu) which is, more or less, my OGS ranking.

Good luck and good results with the thesis presentation and defense

Counting_Zenist · September 16, 2024, 12:09pm

From what I can gather by cutting off the number of moves in each record to just 1, 2, 3, 10, 20, 30, 50 moves, and various numbers of records compared to individual records. After each move is added, it is given an evaluation strength value (I suspect just an additional value head using the final fully connected layer), and then some kind of moving average or weighted average to pull the strength toward a final value (How different records combined is not simply average though, and seems to bias toward higher ranks records)

Animiral · September 16, 2024, 1:48pm

Thank you all for trying out my project.

To address some of the results that seem very off, I should elaborate on the strength model being “as accurate as Glicko-2”. More precisely, if you take two players black and white and their estimated ratings b and w, and you define the black winning chances as 1 / (1 + exp(w - b)) (the “usual” logistic function model), then the strength model maximizes the predicted likelihood of the actual match outcome about as well as Glicko-2 does—maybe just a tad worse. It outperforms for new players with 4 games or less because the moves give much more information than just win/loss.

For training, all the rating labels (Glicko-2 based) are normalized to mean 0 and stdev 1. On this scale, the MSE is 0.861. On label scale, this corresponds to an error of about sqrt(0.861)*315.8 ≈ 293 points, quite a swing! For context, 1-kyu = 1837.38, 1-dan = 1918.49, a difference of 81 points. However, these rating labels are just an auxiliary training target, while the real performance measure is the match outcome log-likelihood.

In this figure you can see a representative sample of 10k labels (Glicko-2 ratings) and how the strength model evaluates them, based on up to 500 recent moves:

model-vs-labels

One visible tendency is that the strength model does not like the extremes. This is because Lee Sedol level players are very rare on OGS (where the data originated), and maybe also because betting so high or low, the model risks being wrong by a lot if the stronger player doesn’t win in the end.

For the website, the model outputs were matched to true OGS scale by sampling 200 historical OGS ratings from 100 random games and fitting a linear function f(x) = ax + b of the model outputs x to minimize MSE. It came out to a=334.03 and b=1595.1 almost like the label-scale.

Sure!
First, for every move to evaluate, the board posititon goes through the KataGo network. From its trunk output, i.e. the internal representation of the network’s knowledge, the 384-value vector at the move location is picked out. The strength model itself is much smaller and it has no Go knowledge, it just works on these KataGo trunk outputs.

The strength model uses the Set Transformer architecture. Its purpose is to pay attention to the moves which are informative about the strength of the player, and to disregard obligatory atari responses etc.—there is a little bit more to it than a moving average.

Counting_Zenist · September 16, 2024, 4:20pm

Your training data is suffering from imbalanced data issues for very strong players and very weak players, and the regression to mean.

The 384 element vectors might not contain “information” to distinguish very weak to weak player’s moves in the first place (the bad and very bad moves are all out of the scope, and the distinctions might be minutes, while the highly sensitive distinctions are reserved for those candidate blocks/shapes patterns, which most strong players would be able to pick out regardless, and I suspect not because the model is risk aversion. At the very top level, it might be measuring how closely correlated a player’s move to “the katago styles”, than how they are correlated to strength. I’ve tested the Golaxy and Fineart games, and they were given a lower rating (6.2-dan) than Shin jinseo, not because they are weaker, but because they focus differently on key shapes/patterns. And Kata-one-playout, which is weaker, but due to its closer “styles” can get over 6d, even 6.5-dan. While a stronger AI like 15bTurboLeela only get 2.5-dan with a very different styles (I even excluded the predictable ladder fail games, but it still ranked much lower)

Counting_Zenist · September 16, 2024, 4:37pm

I am most curious as to why this failed ladder game will give 15bTurboLeela a 1.2-dan rating, is running a failed ladder “good rating and winning moves”?

Animiral · September 16, 2024, 5:04pm

Your comments are certainly on-point and reasonable.

It was trained on a representative sample of OGS games and that’s what it covers. The model doesn’t have a defined goal besides performing well, in comparison with Glicko-2, on that same data.

Among a myriad of things that I could have done, but did not do, within the scope of my thesis, is sampling the training data evenly from all ranks instead of the representative OGS distribution. And if I had included the KataGo weights in the training process, instead of treating that part as static precomputation, it could have adapted to the purpose of strength estimation even for weaker players. I don’t have any means to explain particular outputs, like the failed ladder.

Not doing these things comes down to just picking my battles and getting it done, which took me long enough anyway. I tried training with the KataGo weights, but since one match outcome estimate includes 2*500 KataGo outputs, doing it with the cookie-cutter approach is computationally infeasible.

If it was my PhD and not Master’s, maybe I could have investigated all these scenarios.

trohde · September 16, 2024, 7:31pm

Very cool!
Would it be possible to make this suck our games directly from OGS, like Got Stats? does?

Also: now I have an ear worm again, thank you!

Counting_Zenist · September 16, 2024, 7:40pm

I ran some tests, and by cutting off other corners’ joseki exchange, the predicted ranking dropped to 0.9 kyu, and by cutting off even more unnecessary exchanges to form a ladder, the ranking dropped to 2.7 kyu (move 12 start the ladder), 6.0 kyu (move 10 start the ladder), 3.0 kyu (move 8 start the ladder, minimum cross cut ladder shape). And when I cut off all the moves before the ladder started, the game record where ladder starts at move 12, and only contain these 12 moves reported as 2.5 kyu, the 10 moves running ladder one dropped to 11.8 kyu, the 8 moves running ladder one dropped to 5.3 kyu. Most interestingly, the 0.9 kyu cut at move 18, actually boost its rank to 1.6 dan. It seems running a failed ladder only reduce the ranks when the starting ranks are low, and barely has effect or a boost if the ranks are high enough.

Also strangely, adding handicap stones seems to make the model think a player becomes stronger (I’ve observed this before when I upload my IGS games for testing, and most of them are handicap games, and some of those getting high handicap stones (4 and above) tend to get higher rank scores (some even surpass my estimated rank, although they lost their games against me with high handicaps)

square.defender · September 17, 2024, 3:25am

Counting_Zenist · September 17, 2024, 3:47am

I use the katago new human-like supervised trained network set at rank 20k with just 1 playout against the humanlike bot 15k from ai-sensei. The katago human-like networks are surprisingly strong (and fight way better and good at openings), and they both seem to be over-evaluated. Especially katago human-like, be evaluated as near dan level strength.

jlt · September 17, 2024, 8:07am

I don’t think AIs can be evaluated accurately. The 15k bot makes 5k moves, as well as 25k moves, which is confusing.

square.defender · September 17, 2024, 2:21pm

I guess you need more playouts, not less. If mission is to become 20k and neural net itself is too strong I think playouts are used to become more like 20k.

Counting_Zenist · September 17, 2024, 3:22pm

More playouts simply make it much stronger, might even be actual dan level-strong. From lightvector’s document about human SL model. Effectively the policy move output is already the probability of the expected rank, and you only need to treat it like PDF and roll the dice to get a sampling move. (and from my personal judgement, it probably can play at weak SDK level games with no trouble, even using just 1 playout)

Note that if you are searching with many visits (or even just a few visits!), typically you can expect that KataGo will NOT match the strength of a player of the given humanSLProfile, but will still be stronger because the search will probably solve a lot of tactics that players of a weaker rank would not solve.

The human SL model is trained such that using only one visit, and full temperature (i.e. choosing random moves from the policy proportionally often, rather than always choosing the top move), will give the closest match to how players of the given rank might play. This should be true up to mid-high dan level, at which point the raw model might start to fall short and need more than 1 visit to keep up in strength.

github.com

lightvector/KataGo/blob/master/docs/Analysis_Engine.md#human-sl-analysis-guide

# KataGo Parallel Analysis Engine

KataGo contains an engine that can be used to analyze large numbers of positions in parallel (entire games, or multiple games).
When properly configured and used with modern GPUs that can handle large batch sizes, this engine can be much faster than using
the GTP engine and `kata-analyze`, due to being able to take advantage of cross-position batching, and hopefully having a
nicer API. The analysis engine is primarily intended for people writing tools - for example, to run as the backend of an analysis
server or website.

This engine can be run via:

```./katago analysis -config CONFIG_FILE -model MODEL_FILE```

An example config file is provided in `cpp/configs/analysis_example.cfg`. Adjusting this config is recommended, for example
`nnCacheSizePowerOfTwo` based on how much RAM you have, and adjusting `numSearchThreadsPerAnalysisThread` (the number of MCTS threads operating simultaneously on the same position) and `numAnalysisThreads` (the number of positions that will be analyzed at the same time, *each* of which will use `numSearchThreadsPerAnalysisThread` many search threads).

See the [example analysis config](https://github.com/lightvector/KataGo/blob/master/cpp/configs/analysis_example.cfg#L60) for a fairly detailed discussion of how to tune these parameters.

## Example Code

For example code demonstrating how to invoke the analysis engine from Python, see [here](https://github.com/lightvector/KataGo/blob/master/python/query_analysis_engine_example.py)!

This file has been truncated. show original

square.defender · September 17, 2024, 3:33pm

Strange design. More easy interface would be to let user just choose kyu strength and number of playouts would be chosen automatically. So user is not forced to do any experiments. Developer should make sure that 20k parameter just always results in 20k strength and 9d parameter always results in 9d strength.

hexahedron · September 17, 2024, 4:25pm

I wonder if you might have misconfigured things. I took a quick skim over the the 20k side of that game and the frequency that it chooses the exact top move of the 20k human prediction is suspiciously high - the top move is almost always the choice in the later part of the game, even when the top move is not even the majority of the policy mass.

If it always chooses the top move, then it would of course be much stronger than 20k because it will be very unlikely to play as many 20k level mistakes due to the way that there tends to be more ways to make mistakes than to make good moves. For example, in a position where 20k players would be 70% to blunder and only 30% to make the correct move, and the policy correctly predicts this, if there are 3 ways to blunder each with 70/3 = ~23% policy while there is only one correct move that has the 30% policy, then if you always choose the top move you will never blunder whereas if you sample proportionally you correctly reproduce the 70% chance to blunder.

Have you followed the configuration instructions appropriately for whatever GTP engine you’re using? In particular, passing in both -model and -human-model, and using this config including its particular temperature and other settings except adjusted to 20k? These settings are designed to make sure the sampling happens almost-proportionally. KataGo/cpp/configs/gtp_human5k_example.cfg at master · lightvector/KataGo · GitHub

hexahedron · September 17, 2024, 4:48pm

KataGo here is more intending to provide a more raw interface for downstream developers of tools and research and analysis to leverage how they wish, so for now, it’s easier to provide an interface that directly controls to the low-level things that are happening (e.g. is it using the raw net alone or or is it not?) rather than more complex combinations of them that have been tuned to try to achieve some final outcome (like playing at a particular final strength level).

For what it’s worth, if you are sampling at near full temperature from the raw policy of the human model (easiest way is to use gtp_human5k_example.cfg and change nothing in the config except the rank), then KataGo should not be more than a few stones off in strength from the all the way up to mid-high-dan level, where it then starts to not keep up. (No surprise - no raw policy net sampling at full temperature has ever reached top human levels, this is just not possible with the current size of models with current training methods). And a few stones of difference should already be within the margin of error between what different ranks mean between different servers or Go associations. So I hope all “kyu” ranks should already possible to achieve with only one parameter adjustment, even with the desire to expose parameters in dev-oriented raw form.

@Animiral - it sounds like a cool algorithmic approach. I would be curious if the prediction might be improvable by also augmenting it with the embedding vector from the human SL net configured to some weaker ranks, or whether it turns out that doesn’t help at all. For example, the human SL net, when configured to weaker ranks, might be paying more attention to discriminate between move features that matter for predicting how weaker human players play. But it could also turn out it doesn’t matter, such as if the limiting factor is on data quality, or the representation of the stronger net does contain all the information needed anyways, or there are just fundamental limits in terms of the noise level and variance in the moves in games.

Counting_Zenist · September 17, 2024, 4:59pm

Let me check. I might have messed something up in my code since I don’t think ai-sensei has an existing API, so I have to run a script and pause, and copy-paste the move between them.

Although, I’ve visualized the policy moves when they were showing, and from when I can remember, even the rest “30%” moves are all pretty good moves (on the shape, and not “blunder”). And on the semeai fighting, the policy move at atari is almost always close to 95% or even higher, so I doubt it would even have a chance to blunder.

(I’ll do another test and run it on another UI and see if I fixed it.)