I’ve recently been working on a system to predict strength of a player using deep learning. At this stage, I have some results I want to publish academically, and I thought It would be good to collaborate on the work, so that it can be used in practice (and I’d also like to mention the potential deployment in the paper). I have been communicating a bit with thouis, who told me about OGS’s problem with newcomers who sometimes choose incorrect rank to begin with. Afaik, long convergence of ranking systems is a problem that go servers face in general (with the obvious consequences of bad opponent matching and thus frustrated users), and I think this can be improved much, since it is kind of stupid to take just one bit of info (win/loss) from a game.
My system has been trained on OGS games and has the following properties:
predicts whether user is weak/intermediate/strong player
needs as few information as one position (+4 previous moves), to work (~71% accuracy); this means that rank can be detected almost realtime
more predictions can be aggregated (one game => 78% accuracy)
I am currently optimizing the model to gain further accuracy. The 3 classes can be extended to 5 (or more, or even full rank prediction), but as the network is limited by the info it gets (one position + 4 moves), the accuracy goes down with bigger target class. So my current approach has the advantage of needing few info (thus allowing almost online prediction) at the cost of not being more capable. I have plans to overcome this in the near future, but this is the state at the moment. I’m publishing the source code & model & dataset asap.
So, I have a few question to the OGS devs:
are You interested in collaboration? As I’d like to see my ideas in work, I am willing to contribute code/analysis.
would You agree with me writing about the collaboration in the paper? This is currently quite urgent for me, as I need to publish it soon.
could you share how the ranking system works? (I mean the math behind it), I would like to investigate ways to incorporate the info into it. Is it bayesian based, like TrueSkill, or more like home-brewed ELO variant (as some threads insinuate)?
maybe most importantly, what do You guys think? Any comments?
The ranking system we use is based off of the EGF ranking system, with a tweak to inject some points into the system to help address the problem of rank depression in the lower ranks.
def compute_rating_update(black_rank, black_rating, white_rank, white_rating, black_lost, white_lost, handicap, kfactorScale):
'Derived from http://www.europeangodatabase.eu/EGD/EGF_rating_system.php#System'
epsilon = 0
class P(): pass
def K(r):
k = kfactorScale * min(122, (122 - 6*(float(r)/100.0) + math.pow(float(r)/100.0,2)/15))
return k
def K_scale(S, rating):
'Used to reduce the effect of a loss for lower ranked players'
if (S > 0): # if they won, full effect
return 1.0
t = (rating + 125 ) / Decimal(325.0)
return 1 / (1 + math.exp(-t)) # otherwise scale it back some
B = P()
B.rating = black_rating
B.ranking = black_rank
if handicap > 0:
B.ranking += handicap
B.rating += Decimal(100*(handicap-0.5))
W = P()
W.rating = white_rating
W.ranking = white_rank
B.S = None
if black_lost and white_lost or (not black_lost and not white_lost):
B.S = 0.5
elif black_lost:
B.S = 0.0
else:
B.S = 1.0
W.S = 1.0 - B.S
D = float(abs(B.rating - W.rating))
D = min(900.0, D)
a = float(205 - (min(B.rating, W.rating)/20))
if abs(a) <= 0.5:
a = 1
S_expected = (1.0 / (math.exp(min(100, (D / a))) + 1)) - (epsilon / 2)
if B.rating < W.rating:
B.Se = S_expected
W.Se = 1.0 - S_expected
else:
W.Se = S_expected
B.Se = 1.0 - S_expected
bnew = black_rating + Decimal(K_scale(B.S, black_rating)*K(black_rating) * (B.S - B.Se))
wnew = white_rating + Decimal(K_scale(W.S, white_rating)*K(white_rating) * (W.S - W.Se))
B.rating = max(-900, bnew)
W.rating = max(-900, wnew)
B.ranking = int((B.rating + 900)/Decimal(100.0))
W.ranking = int((W.rating + 900)/Decimal(100.0))
return { 'black': B, 'white': W }
We’d certainly be interested in your project! It sounds like it has a lot of promise and could be quite useful.
As for collaboration - that depends on what you need from us?
I think this could be quite helpful in a couple of areas, adjusting provisional ranks and sandbagger detection. It probably wouldn’t be that difficult to pass provisional games through the network and use it to verify initial rank assignments.
@bronislav: what framework are you using for your system? And do you do your own SGF->feature processing, or use something “standard”?
Thanks for the code, EGF’s GoR is basically just rescaled ELO, so it seems reasonable. Imo nice thing about modern Bayesian approaches is that confidence of the estimated rank is taken into account, but this is for another debate
Regarding the collaboration, since I am already getting the games from your open api, I do not need anything else at the moment but your agreement to experiment with the system on ogs in the future - s.t. I can truthfuly mention something like “system based on X will be experimentaly deployed on ogs in the near future” in the paper; this is basically just a formal thing; for my paper, it is nice to have some real application coming.
Most importantly, I would really like to see it working, so I am offering my time to help with the analysis/development/deployment, so that it does not end just with the proclamation.
The code is written in python, and I use a fork of my github project deep-go-wrap for feature extraction and dataset generation. For the deep learning part, I found the keras framework to work best for me.
When you are ready what we’d probably like to do is deploy this to our beta environment in order to see how well it works. We’d likely integrate it during post-game processing so that it runs in our task queue where we perform other analysis and statistics.
What does the run time and resource usage look like?
The code basically has 2 main parts; feature extraction, which runs on CPU and forward pass of the network, which is better to run on GPU. On my station, which has i7 @ 3.20GHz and a fairly old GTX 580, it takes < 1 sec per game, when running on 6 cores. But the code is fairly inefficient, e.g. the positions are sent to the gpu one by one, not in batches, which would speed the GPU processing a lot. I have not profiled the code much, but I think that the bottleneck is the feature extraction which has a lot of code written in pure python and could be optimized a lot.
I’ve got a Cython wrapper for fuego’s libraries that can extract Alphago features, so if you’re using a subset (or similar set) of those, that might be useful.
Running on the CPU shouldn’t be prohibitive. derpbot runs its network on the CPU and it’s still less than 2 seconds per evaluation (and my guess is its network is similar (in cost, at least) to what @bronislav is using).
I think the postions are middle game positions. Though I’m wondering if the 71% accuracy is meaningful, as we don’t know what kind of range it outputs: How quickly does it come down to a narrow range of a few ranks or even a few rating points?
The accuracy listed is of course an average, during fuseki (first ~ 30 moves) the error is higher; nothing can be predicted with empty board, almost nothing can be predicted after one move, etc…
It is not last 5 moves, but rather a current whole board configuration + last 4 moves.
The network only predicts 3 classes at the moment: strong player, intermediate player, weak player. The accuracy means that in 71% of predictions the class is predicted correctly. In >96% of cases the class is the one of top 2 predicted classes, this means that when the player is weak & the network makes an error, it says that the player is intermediate, rather than strong.
For the record, when I said joseki it didn’t necessarily mean fuseki. Many established patterns, like reduction, side invasion or kikashi, are played during the middle game.
I love the idea of adding more information to the analysis. Wins and losses are very superficial, and form a zero-sum game that eventually causes the rank to change when it should not change on other online Go systems I have used. The ranking system should be able to recognize when someone doesn’t really know the rules of Go. For early beginners, it should recognize when the player doesn’t know how to make a group of stones live. That, right away, should predict a rank like 24 k. Perhaps it is possible to detect “playing at a distance with balance”, although that would probably fail with mirror playing. Detecting standard joseki and fuseki might be useful in predicting better kyu levels. Predicting dan levels might be much more challenging, though, and wins/losses might be more meaningful at those levels. I am so happy that OGS may integrate such interesting improvements in estimating ranking.