How Deep is your Go?

Animiral · September 15, 2024, 6:58pm

How Deep is your Go is a website that can take your SGF games and give you an OGS-scale rating number and rank.

If you send multiple files, make sure that the player name is exactly the same in all of them, so that the program knows whether to evaluate the black or the white moves. If it’s just one file or the name is otherwise ambiguous, you will need to enter the name in the box and again, it should match the player name in the SGF record.

Those who have closely followed my every word on these forums might remember that one time when I mentioned working on my thesis for my Master’s degree. This includes the training of a neural network that can do what you see on the website, and the website to do it.

Well, I did that and today is the day where you can try it out for yourself.

Shoutout with thanks to my friend Zerix, who provided his server with a suitable GPU to host this, and also advice and guidance for all the required web and domain setup steps.

Gia · September 15, 2024, 7:10pm

Congratulations are in order, or?..

Animiral · September 15, 2024, 7:16pm

Once I actually graduate, sure.
I still need to polish the thesis paper, address the final feedback and hand it in. There is a seminar presentation upcoming, and the defense (exam). With all the formalities it will take until December.

square_fuseki · September 15, 2024, 8:02pm

Animiral · September 15, 2024, 8:08pm

Error messages are admittedly not all that fleshed-out. It’s more like a haphazardly welded steampunk pile of junk.

The error message appears because you have either not entered a player name in the name field, or not entered one of the two player names exactly as they appear in the SGF file (either .stone.defender. or sandyfriend123).

If you give more than one of your games, against different opponents, you can leave the name blank and it will find it automatically.

I might fix that uninformative output in the near future, but not today.

square_fuseki · September 15, 2024, 8:11pm

Animiral · September 15, 2024, 8:13pm

This is not another bug report, is it? Maybe this is a good omen for your rank.

With a sample of 4-5 games, the model is about as accurate as Glicko-2.

square_fuseki · September 15, 2024, 8:20pm

last 6 games

Counting_Zenist · September 15, 2024, 8:38pm

I remember OGS ranking had been adjusted a few times in the past. So just out of curiosity did you cut off or adjust them in older records in your training data?

square_fuseki · September 15, 2024, 8:41pm

7 games

Animiral · September 15, 2024, 8:44pm

The complete dataset is filtered into a pool of games by several quality criteria, like only 19x19 no-handicap. All the games in the pool are then subjected to the same Glicko-2 implementation that OGS uses (goratings). Just like OGS re-calculated all the ratings, I’m using ratings calculated under this one system as training data.

By the way, this service is offered with zero warranty!

square_fuseki · September 15, 2024, 8:49pm

square_fuseki · September 15, 2024, 9:11pm

9 live even 19x19 games vs humans

Counting_Zenist · September 15, 2024, 9:46pm

IIRC OGS goratings still need to be calibrated with anchors, and there were a massive survey back in the day. I think they were calibrated to EGF or AGA rating (or some weighted average honestly not sure)

square_fuseki · September 15, 2024, 10:29pm

Counting_Zenist · September 15, 2024, 11:44pm

Based on just small samples from people testing here, I wonder what is the MSE or RMSE for your model? (it seems to be pretty large, 3 or 4 ranks at least)

There is ground truth with known OGS ranking, so it shouldn’t be hard to test.

square_fuseki · September 16, 2024, 12:03am

Test should be done with fresh games. Something may be wrong with ranks in games from times before Glicko.

Counting_Zenist · September 16, 2024, 1:22am

Test data doesn’t necessarily need to be fresh (latest) games, sometimes for output that is time-dependent, it might be better to split test-data out of the training dataset with a relavent time-frame.

Think of it as people are constantly learning and their strength can go up and down, and we have no idea if their strength has reached a relatively steady state, and even if their strength has stabilized, they can still jump up and reach another steady state. If you always just split the tail of the dataset to be test data, the model might not be fitting to predict current strength (but a projection).

square_fuseki · September 16, 2024, 2:35am

jlt · September 16, 2024, 7:33am

Nice job! I tested it with my games, only even 19x19 games against high kyu or low dan opponents.

5 correspondence games: 2.0d
5 live games with long time settings: 1.2d
5 fast games on Tygem: 0.1k.

Pretty consistent with the fact that game quality decreases with faster time settings.