You’re still attempting to use what seems to me to be the wrong goal - predicting the result of evenly-matched games. Yes, you might as well flip a coin for those.
The only way one rating system could be better than another is if their predictions differ - one of them might predict that White has the advantage while the other thinks Black does. Then depending on which way the game goes, we gain a little evidence about which system might be better.
I think that when the rating gap is the same, predictions would be the same, at least if the rating system is some sort of Elo rating system. When the rating gap between opponents gets larger, the win prediction would be skewed further away from 50%.
But Elo based rating systems can still differ in the scale of the rating change of both players when a game result is processed, which affects predictions for their future games.
In a regular Elo rating system, as used in chess, the rating change sizes are scaled by the K factor.
A large K factor results in more rating volatility and a low K factor results in less rating volatility.
Chess rating systems usually use a high K factor (more volatile rating) for lower rated players and a low K factor (less volatile ratings) for higher rated players. The EGF rating system has a similar scale factor called con, which depends on the current rating of a player. (EGF ratings system | E.G.D. - European Go Database)
I think that Glicko-2 has some similar rating volatility mechanism, but I don’t know much about it. From what I understand, it not only depends on the current rating of a player, but also on the recent result history of a player.
What I mean is that if we look at a particular game, rating system A might think Black is 12k and White is 10k, while rating system B thinks Black is 11k and White is 15k.
Then if, say, Black wins, that is some kind of evidence that system B is better than system A.
To compare systems, mismatched games are the ones we should be looking for.
An Elo rating system is in itself already a sort of logistic regression system, continuously minimising apparent mismatches between its expectations and actual events.
If a rating system is pretty bad, I think you’ll already find mismatches between players with the same rating, where some of them are evidently much stronger than others (winrates far away from 50%).
A second order way in which a rating system can be bad (but it may be less obvious), is when almost all equally rated players actually have a 50% winrate against others of the same rating (looking good), but rating gaps are scaled wrong, so that winrates between unevenly matched players are much less or much more skewed than the rating system predicts.
I’m pretty sure it does but in a very complicated way.
I mean it has to right?
It looks like you compute volatility which is based on the results of the games, and some iterative algorithm, this then makes some preperiod guess at the new deviation, and then that gets updated again with a quantity v which depends on all the games results.
@Feijoa If we look at say some of my recent games with the api
ended
game_id
played_black
handicap
rating
deviation
volatility
opponent_id
opponent_rating
opponent_deviation
outcome
extra
annulled
result
1714928783
61712295
1
0
1841.87
64.33
0.059997
1208231
1551.76
118.29
0
null
0
Resignation
1714854525
63990994
1
0
1860.62
64.01
0.059991
681030
2052.45
62.59
0
null
0
Timeout
1712514504
63187854
0
0
1866.62
63.98
0.059992
614879
1960.43
63.78
0
null
0
Resignation
1712090237
63041688
1
0
1875.45
64.15
0.059994
788871
1786.14
86.31
0
null
0
Timeout
1710353740
62048349
0
1
1890.08
64.27
0.059992
811318
1720.78
65.21
1
null
0
Resignation
1709891931
61712293
1
0
1881.16
64.45
0.059993
868221
1739.64
61.03
1
null
0
Resignation
1709834427
61712292
0
0
1873.67
64.56
0.059994
54729
1755.47
63.12
1
null
0
Resignation
1709681115
62205925
0
0
1865.44
64.73
0.059996
1348548
1366.91
62.24
1
null
0
64.5 points
1709679034
62205283
0
0
1864.10
64.12
0.059996
1348548
1368.17
61.57
1
null
0
143.5 points
1709656550
61712294
0
0
1862.76
63.50
0.059997
915719
1545.64
62.61
1
null
0
Resignation
1709603311
62178683
0
0
1859.45
63.16
0.059998
1524321
1372.31
95.82
1
null
0
63.5 points
1709601311
62178033
0
0
1858.03
62.53
0.059999
1285501
1684.37
61.53
0
null
0
Resignation
1709599180
62177342
1
0
1874.50
62.42
0.059994
1497548
1864.64
61.19
1
null
0
Resignation
1709576028
62168430
1
4
1863.45
62.55
0.059994
38364
2238.99
65.10
0
null
0
Resignation
1709557391
62161289
1
0
1876.32
62.65
0.059993
1400433
2032.38
63.05
0
null
0
Timeout
1709375297
62098060
0
0
1882.98
62.63
0.059995
241688
1797.27
61.23
0
null
0
Timeout
1708394769
61753656
0
0
1897.11
62.70
0.059993
118604
1599.08
63.77
1
null
0
9.5 points
1708388194
61751546
0
0
1893.57
62.37
0.059994
1472986
1960.82
63.03
1
null
0
Timeout
1708386532
61751021
0
0
1880.14
62.44
0.059993
1178493
1808.33
61.66
0
null
0
5.5 points
1708020212
61619403
0
0
1893.76
62.52
0.059991
937108
1711.37
63.08
1
null
0
Resignation
1707683923
61500668
1
0
1887.81
62.44
0.059993
1233684
1654.56
74.29
1
null
0
55.5 points
1706890633
59621153
0
0
1883.06
62.23
0.059994
778702
1883.06
65.71
0
null
0
Resignation
1706873695
59621169
0
0
1894.32
62.34
0.059994
782566
1722.35
64.65
1
null
0
Resignation
1706303329
61023149
0
0
1888.14
62.27
0.059996
742404
2077.83
67.40
0
null
0
Resignation
1705809041
59621162
1
0
1893.86
62.16
0.059997
499807
1917.47
65.89
1
null
0
Resignation
1705530480
60758055
1
0
1881.88
62.26
0.059997
703157
1917.91
66.92
1
null
0
0.5 points
1704579880
60433086
0
0
1869.49
62.35
0.059996
1041259
1804.74
62.78
1
null
0
1.5 points
1704475861
59621158
0
0
1860.22
62.45
0.059997
1039
1934.75
66.68
0
null
0
Resignation
1703411584
59621147
1
0
1869.20
62.55
0.059998
130250
1678.06
62.75
0
null
0
Timeout
1701371020
59363707
0
0
1886.07
62.40
0.059993
868221
1770.27
60.65
1
null
0
Resignation
1701111807
59276987
0
0
1878.33
62.44
0.059994
1312374
1903.62
61.93
1
null
0
Resignation
1700863379
59194012
1
0
1866.16
62.56
0.059993
831416
1837.59
61.65
1
null
0
Resignation
1697742346
58121505
1
0
1855.67
62.69
0.059994
1299675
2068.89
84.74
1
null
0
Resignation
There’s some games where I lose to stronger players ~15 or so games down the table the deviation goes down slightly (62.5, 62.4 ish) then I lose to a lower rated player and it starts going up, 63, 64 and the other wins aren’t actually brining it back down until maybe the third entry but then I lose again a few times it goes back up, particularly when I resigned today against a 5kyu.
Amybot ddk for example is very different, the volatility is pretty much maxed out I think. I don’t think it can go above 0.15 for instance.
ended
game_id
played_black
handicap
rating
deviation
volatility
opponent_id
opponent_rating
opponent_deviation
outcome
extra
annulled
result
1714931495
64018627
1
0
904.47
100.98
0.149693
869277
1069.30
62.40
1
null
0
9.5 points
1714931246
64018433
1
0
860.44
100.76
0.149622
424460
1362.87
76.82
0
null
0
Resignation
1714931076
64018554
0
0
863.72
98.27
0.149632
1419172
1002.92
62.03
0
null
0
Resignation
1714930909
64018527
0
0
882.02
98.45
0.149654
1419172
994.99
62.00
0
null
0
Resignation
1714930851
64018488
0
0
902.40
98.85
0.149674
1419172
1009.55
61.89
1
null
0
8.5 points
1714930635
64018387
0
0
864.06
98.93
0.149629
1419172
1002.37
61.82
0
null
0
Resignation
1714930545
64018047
1
0
882.70
99.23
0.149651
1501209
632.24
80.95
0
null
0
32.5 points
1714930487
64018342
0
0
928.78
97.95
0.149549
1419172
993.02
61.83
0
null
0
Resignation
1714930365
64018293
0
0
952.81
98.51
0.149561
1419172
982.60
61.84
0
null
0
Resignation
1714930234
64018247
0
0
980.01
99.21
0.149566
1419172
970.97
61.85
0
null
0
Resignation
1714930146
64017934
1
0
1010.94
99.97
0.149558
1505560
889.87
62.08
1
null
0
16.5 points
1714930066
64018156
0
0
990.53
100.56
0.149579
1419172
958.62
61.84
0
null
0
Resignation
1714929957
64017988
0
5
1024.33
101.45
0.149564
1472623
747.19
61.59
1
null
0
25.5 points
1714929910
64018104
0
0
1001.24
102.45
0.149581
1419172
968.99
61.85
1
null
0
0.5 points
1714929632
64017650
1
0
971.83
103.83
0.149586
1329510
1150.89
62.39
0
null
0
Resignation
1714929518
64017948
0
0
989.27
104.50
0.149610
545062
903.54
63.19
1
null
0
Resignation
1714929518
64017952
0
9
963.63
106.08
0.149625
1553152
321.80
74.21
1
null
0
Resignation
1714929496
64017824
0
4
959.72
103.99
0.149636
1472623
757.58
61.58
1
null
0
35.5 points
1714929444
64017925
0
0
931.88
105.59
0.149647
1554022
511.66
88.47
1
null
0
50.5 points
1714929293
64017742
0
9
926.24
103.92
0.149661
1553152
330.92
74.82
1
null
0
77.5 points
1714929281
64017861
0
0
921.33
102.00
0.149674
1554022
515.97
88.82
1
null
0
42.5 points
1714929066
64017570
0
5
915.65
100.16
0.149689
1472623
748.06
61.58
0
null
0
Resignation
1714928965
64017723
0
0
945.34
101.12
0.149688
1554022
519.89
89.09
1
null
0
Resignation
1714928767
64017634
0
0
940.31
99.08
0.149702
1554022
524.01
89.41
1
null
0
Resignation
1714928449
64017370
0
4
935.26
96.93
0.149717
1472623
759.10
61.59
1
null
0
55.5 points
1714928390
64017339
0
7
909.24
97.37
0.149722
1498231
691.26
77.09
0
null
0
20.5 points
1714927934
64017108
0
6
935.78
97.79
0.149725
1472623
750.75
61.57
0
null
0
Resignation
1714927822
64017054
0
0
961.96
98.36
0.149731
1163766
653.56
66.48
1
null
0
64.5 points
1714927571
64017102
1
2
953.46
96.98
0.149752
451385
1173.36
63.78
0
null
0
Resignation
1714927375
64016755
1
0
973.09
97.13
0.149772
1505560
875.53
62.02
0
null
0
24.5 points
1714927331
64016900
0
0
1009.35
97.05
0.149731
1501171
835.32
62.03
1
null
0
17.5 points
1714927310
64016993
1
2
993.93
96.76
0.149755
1080566
1017.30
62.89
1
null
0
8.5 points
1714927045
64016883
1
2
970.87
97.10
0.149768
1080566
1028.11
62.99
1
null
0
56.5 points
1714926966
64016928
0
4
944.99
97.55
0.149774
546626
887.28
61.88
1
null
0
14.5 points
1714926943
64016810
1
2
907.55
97.42
0.149728
451385
1166.78
63.74
0
null
0
Resignation
1714926784
64016722
0
0
924.46
97.35
0.149751
1501171
844.23
62.06
1
null
0
46.5 points
1714926747
64016776
1
2
902.03
97.74
0.149766
1080566
1017.34
63.09
0
null
0
Resignation
1714926669
64016793
0
0
929.27
98.30
0.149769
418160
919.08
70.87
1
null
0
20.5 points
1714926591
64016794
1
0
900.74
98.90
0.149768
43708
1383.35
75.38
0
null
0
8.5 points
1714926421
64016636
0
0
904.25
96.37
0.149779
1118872
905.42
62.22
0
null
0
Resignation
1714926382
64016468
1
0
932.47
96.69
0.149775
1505560
861.99
62.00
0
null
0
48.5 points
1714926268
64016486
0
0
966.44
96.74
0.149746
1501171
852.12
62.05
1
null
0
10.5 points
1714926223
64016646
0
4
946.93
96.86
0.149766
546626
900.74
62.04
0
null
0
Resignation
1714926194
64016516
1
2
965.80
96.93
0.149787
451385
1182.84
63.69
1
null
0
23.5 points
1714926071
64016425
0
0
928.87
96.71
0.149741
1118872
916.51
62.28
1
null
0
25.5 points
1714926060
64016601
0
4
901.33
97.10
0.149740
546626
894.28
61.94
0
null
0
Resignation
1714925704
64002724
0
0
917.61
96.92
0.149764
1272533
1071.26
61.76
0
null
0
33.5 points
1714925595
64016185
0
0
934.36
96.79
0.149787
1501171
837.83
61.99
0
null
0
1.5 points
1714925524
64016190
1
2
970.28
96.66
0.149747
451385
1174.51
63.76
0
null
0
10.5 points
1714925142
64016206
0
0
990.96
96.85
0.149765
1419172
956.59
61.84
0
null
0
Resignation
1714925065
64016157
0
0
1022.32
97.13
0.149749
1419172
966.18
61.85
1
null
0
15.5 points
Probably it makes sense since the bot has crazy results all the time, beating 13kyus and then losing to 25kyus. It’s almost a definition of volatile
If I get back to a computer I should try make some graphs or something to plot volatility, deviation and outcome against each other some how and see if the behaviour is consistent or intuitive in some way.
But it does look like say Amy it’s deviation changes a lot more because of the high volatility
The updated deviation is independent of results. I agree that it doesn’t sound very reasonable, but if I remember correctly it has something to do with how we use a period consisting of only a single game.
It does seem very odd, if it turns out the calculator is wrong it’s probably my fault trying to get the python code turned into javascript Definitely some things are off like the win rate that I don’t think is shown anywhere.
The deviation never goes very low because the period never includes more than one game. Using a time-based period would cause a player who plays more frequently to have lower deviation (all other things being equal).
The Glicko 2 formulas are beyond my math abilities, but I have a fuzzy idea of the notation so this may be naive.
Is it possible to use some hybrid Glicko where the rating update is based on single game periods, but the rating deviation and volatility updates are based on periods?
The idea would be to preserve the win = rating goes up (and loss = rating goes down) expected behavior, but also make the rating deviations and volatility somewhat more useful for the median active user? I.e. Set the period to a length where the median active user plays at least 5 games.
I rushed this together fairly quickly and didn’t edit extensively, and @anoek graciously let me land it pretty much as-is even though it probably has lots of confusing bits and (surely!) some outright errors.
BTW, pull requests are welcome!
I didn’t capture everyone’s ideas. With limited time, I focused on the ones that I’ve been thinking the most about myself, especially where I’d thought through some plausibly useful math that I was afraid I’d forget.
If your idea isn’t in there, it hasn’t been rejected. I just didn’t write it down (yet).
BTW, pull requests are welcome! (Although see urgent stuff at the bottom for the collaboration I think is most useful right now…)
There’s a bunch of MathJAX / LaTeX / whatever… those bits are much easier to read in the browser than in the file when you’re editing. (I think it’s the right tradeoff? Math is hard in text files. Another option would be pseudo code or python code or something, but that’s harder to compare to the Glicko-2 paper… that’s why I stuck to math…)
The high-level summary is:
Improve the metrics
Review the current metrics, which are mostly targeting predictive performance, and remove the ones that we don’t find useful to reduce noise.
Expand the metrics for things we want to track but aren’t currently tracking (e.g., performance when rank diff is bigger than 1; historic volatility of ratings; etc.)
Show more cross-sectional views of the metrics (by rank? by board size? by game speed? human or bot?)
Develop a script that models the specific ratings categories (e.g., blitz-9x9) as the primary drivers, and tune how the ratings categories interact to improve their performance
I expect that with some effort we can make this much better than the status quo (where “overall” is the primary driver)
Model fixed-length (but start time variable per-user) time periods
Ideally, make ratings relatively stable for players who play relatively frequently and whose playing strength is relatively consistent, while still allowing quick changes when a player’s strength changes
I’m out (no coding, just whatever I can do easily from mobile) for a few months. The document currently answers the question, “What would @dexonsmith do in the goratings repo if he had a some dedicated time for experiments and no input from others?”
… but I would be delighted if others dove in and moved forward before I’m “back”, even if that means the direction evolves significantly from what I’ve envisioned. Please keep the document updated on the state of experiments, fix it where appropriate, and reorganize (or even delete) sections that are not useful (we have Git history to recover lost gold). If you include me on pull requests I’ll do my best to give feedback, and include @anoek for his comments/feedback as well.
Urgent stuff: (if you’ll allow me to call something urgent that I won’t be making time for myself until September-ish…):
Improve the metrics so that we’re measuring what we care about. (and/or adding/improving visualizations so it’s easier to understand the results)
Model specific-ratings-categories-as-primary in such a way that we use the metrics to evaluate it (in the document, I call this “cohesive-ratings-grid” or something).
If you have time to contribute, that’s what I’d most appreciate help with! (Once those things are done, it’ll be more interesting to consider ideas for how to make ratings better, since we’ll be able to evaluate them.)
I’m not sure that your specific idea can work well—IMO we should update them together—but it’s certainly possible to preserve the win-means-rating-goes-up property when introducing time periods. The only thing that breaks that property is rolling periods, where you have a look-back window (of games, or of time) that includes different but overlapping subsets of games in each evaluation.
Note that my current idea for what to experiment with, documented in RatingsV6 linked above, doesn’t include rolling periods, so the win-means-rating-goes-up property would be preserved.
Also, note that @anoek is against breaking the win-means-rating-goes-up property (given past experience) unless we have really compelling data that it’s giving us some important win we can’t achieve another way. I’m currently optimistic we won’t need to break it.