How Deep is your Go?

Counting_Zenist · September 17, 2024, 5:29pm

I fixed my script (I found the bug, where I somehow add the movenumber into the calculating pdf resulting in after the movenumber is larger enough, the accumulated prob always goes to max and picks the first one).

Here is the new test result, and it still wins easily by a very large margin I’d say (and I checked, and it does follow the pdf and no longer just picks the first)

And the predicted strength from the OP (it really doesn’t change much)

And from what I can tell, it definitely played like a weak SDK at least. Even AI-sensei analysis agrees, slide up to 9k before the first “blunder” in the game show up (besides the final 3 moves), and up to 5k before 2nd blunder in the mid-game shows up. It really is way stronger than the parameter sets.

square.defender · September 17, 2024, 6:15pm

if that “20k” bot wins vs any 15k bots, therefore it just doesn’t work, its actually too strong

Counting_Zenist · September 17, 2024, 7:02pm

I asked my friend to run another humanlike-bot-15k and this time katahuman-like model was playing black. And it seems to play black very poorly. Also doesn’t know when to quit (the evaluation win rate is messed up at this rank setting and might need to change the resign threshold accordingly)

And it’s ranking is now actually very low (and from all the suicide moves I am not surprised). Overall, it does play more like ddk as black (for some moves). It’s very inconsistent. And the AI-sensei bot is pretty consistent in the estimated ranks, but is still “over-ranked”, and from what I can tell maybe play a bit better than 15k (and also pretty inconsistent with its moves)

I’ve tested locally with different rank settings and engine-vs-engine, and it seems if the opponent is strong, it would “react” a lot more reasonably, and although played slow, but getting “dragged” toward the right direction. But it wouldn’t know how to use sente playing as black, and just snap after the opening when running into “uncharted” positions. And if both are set as rank_20k (preaz_20k seems to fare better), it starts to play very weird moves, like self-filling suicide/self-atari (not just a small group, but filling half of the board). It’s like tdk and sdk all roll into one super-unstable model. (I suspect it has to do with not enough training samples at the lowest rank, or inconsistent training labels)

Animiral · September 17, 2024, 7:43pm

Update! Common errors should now result in sensible error messages.

Oh no, what a waste! I count myself lucky that nobody came out with such a project while I was working on it, although @hexahedron did release the human network.

By the way, if my terse comments here on the forums do not satisfy your curiosity, I am open to just sending out my current close-to-final draft to anyone who asks me via PM.

It contains more illustrations like this one:

network-architecture

Don’t expect this function from me in the next weeks. But the code for the website is public on my Github, so it is just a matter of convincing someone to do it.

This is an interesting test case and I have no idea what’s going on there.

The training process excluded all handicap games, so anything can happen there.

Thank you for your thoughts. The way things are, my strength model depends not just on a KataGo model, but the specific model that it was trained with. For anything else, it would have to be re-trained. Right now, I’m too satisfied about having the hard part behind me to pick up such ideas. But it’s good to note for the future.

Phew!

Counting_Zenist · September 17, 2024, 8:22pm

This is one of the bizarre self-filling game behaviors between rank_20k and preaz_20k models, it’s like both don’t know what they are playing Go

And funny enough the one that gets completely captured has a higher rating

benjito · September 19, 2024, 11:59am

Please consider adding a license! It will allow others to modify your code without worrying about copyright. Licensing a repository - GitHub Docs

BHydden · September 19, 2024, 12:14pm

If you’re not sure, MIT is the best

xela · September 20, 2024, 8:26am

Interesting. First impressions: it’s a lot more complicated than I expected! I guess that’s the nature of masters theses: if you take a simple approach and solve the problem in less than a week, it doesn’t look like a good thesis

You’re not just taking ratings and results as given labels from OGS, but recomputing the ratings to remove the effects of poor time management, internet disconnections, incorrect scoring, resigning when not actually losing, etc. So that a 5k in your project is not someone who performs like a 5k (including those other factors), but someone who chooses moves like a 5k. Have I understood this correctly?
Naively, I would have expected to extract features from KataGo’s neural networks, then feed those features into a simpler model such as some kind of regression, or whatever variety of trees is trendy this year (I know we’ve moved on from random forests, but I’m not entirely up to date). But instead you’ve gone straight to a neural net for strength prediction. Did you try and reject simpler models, or did you decide on nets based on other considerations?
I’m struggling to figure out what’s meant by recent moves in this context.
Interesting stuff. I hope you get a good outcome with the Masters!

xela · September 20, 2024, 8:29am

Also, could there be an option to drop the requirement of player names matching, and just estimate the average strength of all players in all games uploaded? This would make it easier to answer questions such as:

How do the average 5k on each of OGS, KGS, Fox compare by this metric?
How does the average Shusaku-era pro compare with the average modern pro?

Of course, if you make it too easy, you run the risk of someone uploading 1000 games at a time and overloading the server…

turtoise · September 20, 2024, 8:53am

Perhaps this could be used to estimate the strength of a rengo team.

square.defender · September 29, 2024, 10:02am

looks more optimistic than my ogs graph

Animiral · September 29, 2024, 4:57pm

Thank you for the reminder, license is now specified.

I spent the past week on my presentation and polishing my thesis, so please forgive the late response.

Labels are still computed using Glicko-2, i.e., from performance. In preparation, I had every game in the dataset judged by KataGo to get a cleaner result, eliminating the factors that you mention. Labels are not ratings at game time, either. They are taken from 10 games into the future. Let’s say that I play one game every day of January, 2024. Then my label on the game of Jan 10 is my Glicko-2 rating that I had after the game on Jan 20. This makes labels more accurate even from my personal first game on OGS.

First, I wanted to, and second, I had to. My original ambition was to include the KataGo weights in the training, which is only possible with gradient descent style training algorithms. These would not work with forests. I had already committed to the idea of using KataGo trunk outputs for my inputs, so I’m dealing with points in a space with many dimensions with no prescribed meaning to them. Again not a job for a tree , although a simpler regression model might do it.

That said, I did go straight for the neural net. I justify this with an informal “the problem looked kinda difficult” and wanting to expand my familiarity with exciting technology.

Edit: The simplest proof-of-concept model that I tried initially used normal KataGo output features like winrate. It had a feed-forward network to evaluate each move with one hidden layer and two outputs: a rating and a weight logit, to be aggregated over all moves by softmax. It worked eventually.

Recent moves just means moves from recent games. Recent games are simply the newest available ones that we have from the player in question, not their old ones from 10 years ago. So, if we want to predict the outcome of my game from Jan 20 in the dataset, we would estimate my strength based on my games from the preceding days in January. Even though the training data might include my games from September too, we have to pretend that we don’t yet know about them at that point.

This is not really the use case I had in mind, but you can simulate this by uploading the same game twice, once with the player name assigned to black and once to white.

There are rather tight size limits to what the server will accept. It is already computationally expensive to run hundreds of board positions through KataGo for one rating number.
If you want to evaluate Fox 5k, you will have to draw samples.

Good, right?

Counting_Zenist · September 30, 2024, 12:55am

I know this is for the purpose of data cleaning, however, from the post about human unreliable judgment, players even up to SDK can “legit” resign due to they think they are behind, and more often DDK players would just resign when one of their groups get captures, and OGS does account for these “legit” resign into calculation ranks. Is the cleaning “over-do” and what potion are these kinds of games in your training data? (Or did you completely remove them from training data?)

square.defender · September 30, 2024, 2:46am

Population of OGS in 2016 and in 2024 may be very different.
Even after recalculation someone who stopped playing in 2016 and reached 5k may have objectively different strength than someone who is still playing now and 5k too.
So ogs rank is bad metric to see change of your strength with years.
5k 2016 and 5k 2024 may be not same. Neural net that trying to learn how to estimate rank by game will be confused.
So date at which game was played should not be ignored when training such model.
Input should be game and date. Answer may be: “By standard of 2016 you are …, By standard of 2024 you are …”

jlt · September 30, 2024, 7:01am

It might be possible to use Animiral’s program to measure rank drift on OGS.

Animiral · September 30, 2024, 3:51pm

First we have to ask ourselves: what is our actual intention with this clean-up? I am interested in the relationship between certain moves made and playing strength. I am not interested in rating the player as a person with all their misjudgements and emotions. After all, the strength model should apply to anyone. These are some numbers:

Out of 7049004 games in the dataset, 62625 (0.89%) were removed because the end position did not have a clear winner.

Of the remaining games, 553821 (7.93%) had their result changed by computer evaluation. Of these,

312129 (56.36%) flipped black-to-white and 241692 (43.64%) flipped white-to-black;
41538 (7.50%) flipped from a counted board state,
269608 (48.68%) flipped from a timeout result, and
242675 (43.82%) flipped from resignation.

Interesting idea! Is there already evidence for this rank drift?

For training the strength model, games are sampled in random minibatches from the 10k training set, which spans all points in time. Later years are over-represented because OGS grew over time. If the level changes, it would appear to the network as an additional source of noise.

jlt · September 30, 2024, 4:45pm

Beware also of the rating change in January 2021:

P.S. Did ranks of games prior to the rank adjustment get recalculated?

BHydden · September 30, 2024, 10:04pm

Also, Soon™ IIRC

Yes, he always reprocesses all games from all time after a ranking system change. Though some people said there might have been some issues with their early rank graph

Counting_Zenist · October 1, 2024, 5:04pm

Animiral:

First we have to ask ourselves: what is our actual intention with this clean-up? I am interested in the relationship between certain moves made and playing strength. I am not interested in rating the player as a person with all their misjudgements and emotions. After all, the strength model should apply to anyone. These are some numbers:

Out of 7049004 games in the dataset, 62625 (0.89%) were removed because the end position did not have a clear winner.

Of the remaining games, 553821 (7.93%) had their result changed by computer evaluation. Of these,

312129 (56.36%) flipped black-to-white and 241692 (43.64%) flipped white-to-black;

41538 (7.50%) flipped from a counted board state,

269608 (48.68%) flipped from a timeout result, and

242675 (43.82%) flipped from resignation.

Is there a rank breakdown of these flipped records?

I am still curious about linking moves to particular “results” and using them as ground truth for training. Remember my test earlier that cleans out the unnecessary exchange of failed ladder games, and the predicted ranks dropped from dan to high kyu. One simple fact is that everyone from 20k to 9p and AI can all play 3-3, 3-4, star points in those early exchanges, so they are effectively not moves that can be used to predict strength. And players from high ddk to low SDK can already memorize josekis, hence those simpler joseki would also not be predictable factors. Players at high SDK to low dan range would start to experiment with different openings and strange moves and their strange “uncategorized” moves would be indistinguishable from random moves at the very beginner levels. The combinational judgment of a player at high SDK to ddk level can cause misjudgment and push down their strength, it is a simple fact that will have an effect in the prediction (since you changed them in the ground truth). And there are indeed sandbaggers on OGS, so they also contribute to deflating the ranks.

It would be interesting to test using known sandbaggers’ games to test these effects. (there are lots of them on IGS, maybe I can collect some of them to make a test later)

Animiral · October 1, 2024, 7:06pm

Not yet, but there might be if I get around to it in the near future.

I am honestly not sure what you mean by that. Maybe you could give an example?