Using the 5 games between AlphaGo and Lee Sedol
About 2 years ago a student asked me for ideas for a thesis-project and I proposed exactly this. (He was more intested in getting rich quickly with AI stock market predictions at the time though and in the end did neither of these things.)
Nice to see someone has done it now. The idea must be pretty obvious.
Nicely done It returns the results fairly quickly, so I assume itâs just using policy values network values or similar, not doing any playouts? Can you share any more details of the method, or do you need to finish the thesis first?
I do not know if this test will help in your calibrations, but I put in my last 9 ranked games and got this: Player rating: 1796.68 (1.5-kyu) which is, more or less, my OGS ranking.
Good luck and good results with the thesis presentation and defense
From what I can gather by cutting off the number of moves in each record to just 1, 2, 3, 10, 20, 30, 50 moves, and various numbers of records compared to individual records. After each move is added, it is given an evaluation strength value (I suspect just an additional value head using the final fully connected layer), and then some kind of moving average or weighted average to pull the strength toward a final value (How different records combined is not simply average though, and seems to bias toward higher ranks records)
Thank you all for trying out my project.
To address some of the results that seem very off, I should elaborate on the strength model being âas accurate as Glicko-2â. More precisely, if you take two players black and white and their estimated ratings b
and w
, and you define the black winning chances as 1 / (1 + exp(w - b))
(the âusualâ logistic function model), then the strength model maximizes the predicted likelihood of the actual match outcome about as well as Glicko-2 doesâmaybe just a tad worse. It outperforms for new players with 4 games or less because the moves give much more information than just win/loss.
For training, all the rating labels (Glicko-2 based) are normalized to mean 0 and stdev 1. On this scale, the MSE is 0.861. On label scale, this corresponds to an error of about sqrt(0.861)*315.8 â 293
points, quite a swing! For context, 1-kyu = 1837.38, 1-dan = 1918.49, a difference of 81 points. However, these rating labels are just an auxiliary training target, while the real performance measure is the match outcome log-likelihood.
In this figure you can see a representative sample of 10k labels (Glicko-2 ratings) and how the strength model evaluates them, based on up to 500 recent moves:
One visible tendency is that the strength model does not like the extremes. This is because Lee Sedol level players are very rare on OGS (where the data originated), and maybe also because betting so high or low, the model risks being wrong by a lot if the stronger player doesnât win in the end.
For the website, the model outputs were matched to true OGS scale by sampling 200 historical OGS ratings from 100 random games and fitting a linear function f(x) = ax + b of the model outputs x to minimize MSE. It came out to a=334.03 and b=1595.1 almost like the label-scale.
Sure!
First, for every move to evaluate, the board posititon goes through the KataGo network. From its trunk output, i.e. the internal representation of the networkâs knowledge, the 384-value vector at the move location is picked out. The strength model itself is much smaller and it has no Go knowledge, it just works on these KataGo trunk outputs.
The strength model uses the Set Transformer architecture. Its purpose is to pay attention to the moves which are informative about the strength of the player, and to disregard obligatory atari responses etc.âthere is a little bit more to it than a moving average.
Your training data is suffering from imbalanced data issues for very strong players and very weak players, and the regression to mean.
The 384 element vectors might not contain âinformationâ to distinguish very weak to weak playerâs moves in the first place (the bad and very bad moves are all out of the scope, and the distinctions might be minutes, while the highly sensitive distinctions are reserved for those candidate blocks/shapes patterns, which most strong players would be able to pick out regardless, and I suspect not because the model is risk aversion. At the very top level, it might be measuring how closely correlated a playerâs move to âthe katago stylesâ, than how they are correlated to strength. Iâve tested the Golaxy and Fineart games, and they were given a lower rating (6.2-dan) than Shin jinseo, not because they are weaker, but because they focus differently on key shapes/patterns. And Kata-one-playout, which is weaker, but due to its closer âstylesâ can get over 6d, even 6.5-dan. While a stronger AI like 15bTurboLeela only get 2.5-dan with a very different styles (I even excluded the predictable ladder fail games, but it still ranked much lower)
I am most curious as to why this failed ladder game will give 15bTurboLeela a 1.2-dan rating, is running a failed ladder âgood rating and winning movesâ?
Your comments are certainly on-point and reasonable.
It was trained on a representative sample of OGS games and thatâs what it covers. The model doesnât have a defined goal besides performing well, in comparison with Glicko-2, on that same data.
Among a myriad of things that I could have done, but did not do, within the scope of my thesis, is sampling the training data evenly from all ranks instead of the representative OGS distribution. And if I had included the KataGo weights in the training process, instead of treating that part as static precomputation, it could have adapted to the purpose of strength estimation even for weaker players. I donât have any means to explain particular outputs, like the failed ladder.
Not doing these things comes down to just picking my battles and getting it done, which took me long enough anyway. I tried training with the KataGo weights, but since one match outcome estimate includes 2*500 KataGo outputs, doing it with the cookie-cutter approach is computationally infeasible.
If it was my PhD and not Masterâs, maybe I could have investigated all these scenarios.
Very cool!
Would it be possible to make this suck our games directly from OGS, like Got Stats? does?
Also: now I have an ear worm again, thank you!
I ran some tests, and by cutting off other cornersâ joseki exchange, the predicted ranking dropped to 0.9 kyu, and by cutting off even more unnecessary exchanges to form a ladder, the ranking dropped to 2.7 kyu (move 12 start the ladder), 6.0 kyu (move 10 start the ladder), 3.0 kyu (move 8 start the ladder, minimum cross cut ladder shape). And when I cut off all the moves before the ladder started, the game record where ladder starts at move 12, and only contain these 12 moves reported as 2.5 kyu, the 10 moves running ladder one dropped to 11.8 kyu, the 8 moves running ladder one dropped to 5.3 kyu. Most interestingly, the 0.9 kyu cut at move 18, actually boost its rank to 1.6 dan. It seems running a failed ladder only reduce the ranks when the starting ranks are low, and barely has effect or a boost if the ranks are high enough.
Also strangely, adding handicap stones seems to make the model think a player becomes stronger (Iâve observed this before when I upload my IGS games for testing, and most of them are handicap games, and some of those getting high handicap stones (4 and above) tend to get higher rank scores (some even surpass my estimated rank, although they lost their games against me with high handicaps)
I use the katago new human-like supervised trained network set at rank 20k with just 1 playout against the humanlike bot 15k from ai-sensei. The katago human-like networks are surprisingly strong (and fight way better and good at openings), and they both seem to be over-evaluated. Especially katago human-like, be evaluated as near dan level strength.
I donât think AIs can be evaluated accurately. The 15k bot makes 5k moves, as well as 25k moves, which is confusing.
I guess you need more playouts, not less. If mission is to become 20k and neural net itself is too strong I think playouts are used to become more like 20k.
More playouts simply make it much stronger, might even be actual dan level-strong. From lightvectorâs document about human SL model. Effectively the policy move output is already the probability of the expected rank, and you only need to treat it like PDF and roll the dice to get a sampling move. (and from my personal judgement, it probably can play at weak SDK level games with no trouble, even using just 1 playout)
Note that if you are searching with many visits (or even just a few visits!), typically you can expect that KataGo will NOT match the strength of a player of the given humanSLProfile, but will still be stronger because the search will probably solve a lot of tactics that players of a weaker rank would not solve.
- The human SL model is trained such that using only one visit, and full temperature (i.e. choosing random moves from the policy proportionally often, rather than always choosing the top move), will give the closest match to how players of the given rank might play. This should be true up to mid-high dan level, at which point the raw model might start to fall short and need more than 1 visit to keep up in strength.
Strange design. More easy interface would be to let user just choose kyu strength and number of playouts would be chosen automatically. So user is not forced to do any experiments. Developer should make sure that 20k parameter just always results in 20k strength and 9d parameter always results in 9d strength.
I wonder if you might have misconfigured things. I took a quick skim over the the 20k side of that game and the frequency that it chooses the exact top move of the 20k human prediction is suspiciously high - the top move is almost always the choice in the later part of the game, even when the top move is not even the majority of the policy mass.
If it always chooses the top move, then it would of course be much stronger than 20k because it will be very unlikely to play as many 20k level mistakes due to the way that there tends to be more ways to make mistakes than to make good moves. For example, in a position where 20k players would be 70% to blunder and only 30% to make the correct move, and the policy correctly predicts this, if there are 3 ways to blunder each with 70/3 = ~23% policy while there is only one correct move that has the 30% policy, then if you always choose the top move you will never blunder whereas if you sample proportionally you correctly reproduce the 70% chance to blunder.
Have you followed the configuration instructions appropriately for whatever GTP engine youâre using? In particular, passing in both -model and -human-model, and using this config including its particular temperature and other settings except adjusted to 20k? These settings are designed to make sure the sampling happens almost-proportionally. KataGo/cpp/configs/gtp_human5k_example.cfg at master · lightvector/KataGo · GitHub
KataGo here is more intending to provide a more raw interface for downstream developers of tools and research and analysis to leverage how they wish, so for now, itâs easier to provide an interface that directly controls to the low-level things that are happening (e.g. is it using the raw net alone or or is it not?) rather than more complex combinations of them that have been tuned to try to achieve some final outcome (like playing at a particular final strength level).
For what itâs worth, if you are sampling at near full temperature from the raw policy of the human model (easiest way is to use gtp_human5k_example.cfg and change nothing in the config except the rank), then KataGo should not be more than a few stones off in strength from the all the way up to mid-high-dan level, where it then starts to not keep up. (No surprise - no raw policy net sampling at full temperature has ever reached top human levels, this is just not possible with the current size of models with current training methods). And a few stones of difference should already be within the margin of error between what different ranks mean between different servers or Go associations. So I hope all âkyuâ ranks should already possible to achieve with only one parameter adjustment, even with the desire to expose parameters in dev-oriented raw form.
@Animiral - it sounds like a cool algorithmic approach. I would be curious if the prediction might be improvable by also augmenting it with the embedding vector from the human SL net configured to some weaker ranks, or whether it turns out that doesnât help at all. For example, the human SL net, when configured to weaker ranks, might be paying more attention to discriminate between move features that matter for predicting how weaker human players play. But it could also turn out it doesnât matter, such as if the limiting factor is on data quality, or the representation of the stronger net does contain all the information needed anyways, or there are just fundamental limits in terms of the noise level and variance in the moves in games.
Let me check. I might have messed something up in my code since I donât think ai-sensei has an existing API, so I have to run a script and pause, and copy-paste the move between them.
Although, Iâve visualized the policy moves when they were showing, and from when I can remember, even the rest â30%â moves are all pretty good moves (on the shape, and not âblunderâ). And on the semeai fighting, the policy move at atari is almost always close to 95% or even higher, so I doubt it would even have a chance to blunder.
(Iâll do another test and run it on another UI and see if I fixed it.)