A couple of questions about handicap and rating

You’re still attempting to use what seems to me to be the wrong goal - predicting the result of evenly-matched games. Yes, you might as well flip a coin for those.

The only way one rating system could be better than another is if their predictions differ - one of them might predict that White has the advantage while the other thinks Black does. Then depending on which way the game goes, we gain a little evidence about which system might be better.

2 Likes

I think that when the rating gap is the same, predictions would be the same, at least if the rating system is some sort of Elo rating system. When the rating gap between opponents gets larger, the win prediction would be skewed further away from 50%.
But Elo based rating systems can still differ in the scale of the rating change of both players when a game result is processed, which affects predictions for their future games.

In a regular Elo rating system, as used in chess, the rating change sizes are scaled by the K factor.
A large K factor results in more rating volatility and a low K factor results in less rating volatility.
Chess rating systems usually use a high K factor (more volatile rating) for lower rated players and a low K factor (less volatile ratings) for higher rated players. The EGF rating system has a similar scale factor called con, which depends on the current rating of a player. (EGF ratings system | E.G.D. - European Go Database)
I think that Glicko-2 has some similar rating volatility mechanism, but I don’t know much about it. From what I understand, it not only depends on the current rating of a player, but also on the recent result history of a player.

What I mean is that if we look at a particular game, rating system A might think Black is 12k and White is 10k, while rating system B thinks Black is 11k and White is 15k.

Then if, say, Black wins, that is some kind of evidence that system B is better than system A.

To compare systems, mismatched games are the ones we should be looking for.

1 Like

An Elo rating system is in itself already a sort of logistic regression system, continuously minimising apparent mismatches between its expectations and actual events.

If a rating system is pretty bad, I think you’ll already find mismatches between players with the same rating, where some of them are evidently much stronger than others (winrates far away from 50%).

A second order way in which a rating system can be bad (but it may be less obvious), is when almost all equally rated players actually have a 50% winrate against others of the same rating (looking good), but rating gaps are scaled wrong, so that winrates between unevenly matched players are much less or much more skewed than the rating system predicts.

I’m pretty sure it does but in a very complicated way.

I mean it has to right?

It looks like you compute volatility which is based on the results of the games, and some iterative algorithm, this then makes some preperiod guess at the new deviation, and then that gets updated again with a quantity v which depends on all the games results.

1 Like

@Feijoa If we look at say some of my recent games with the api

ended game_id played_black handicap rating deviation volatility opponent_id opponent_rating opponent_deviation outcome extra annulled result
1714928783 61712295 1 0 1841.87 64.33 0.059997 1208231 1551.76 118.29 0 null 0 Resignation
1714854525 63990994 1 0 1860.62 64.01 0.059991 681030 2052.45 62.59 0 null 0 Timeout
1712514504 63187854 0 0 1866.62 63.98 0.059992 614879 1960.43 63.78 0 null 0 Resignation
1712090237 63041688 1 0 1875.45 64.15 0.059994 788871 1786.14 86.31 0 null 0 Timeout
1710353740 62048349 0 1 1890.08 64.27 0.059992 811318 1720.78 65.21 1 null 0 Resignation
1709891931 61712293 1 0 1881.16 64.45 0.059993 868221 1739.64 61.03 1 null 0 Resignation
1709834427 61712292 0 0 1873.67 64.56 0.059994 54729 1755.47 63.12 1 null 0 Resignation
1709681115 62205925 0 0 1865.44 64.73 0.059996 1348548 1366.91 62.24 1 null 0 64.5 points
1709679034 62205283 0 0 1864.10 64.12 0.059996 1348548 1368.17 61.57 1 null 0 143.5 points
1709656550 61712294 0 0 1862.76 63.50 0.059997 915719 1545.64 62.61 1 null 0 Resignation
1709603311 62178683 0 0 1859.45 63.16 0.059998 1524321 1372.31 95.82 1 null 0 63.5 points
1709601311 62178033 0 0 1858.03 62.53 0.059999 1285501 1684.37 61.53 0 null 0 Resignation
1709599180 62177342 1 0 1874.50 62.42 0.059994 1497548 1864.64 61.19 1 null 0 Resignation
1709576028 62168430 1 4 1863.45 62.55 0.059994 38364 2238.99 65.10 0 null 0 Resignation
1709557391 62161289 1 0 1876.32 62.65 0.059993 1400433 2032.38 63.05 0 null 0 Timeout
1709375297 62098060 0 0 1882.98 62.63 0.059995 241688 1797.27 61.23 0 null 0 Timeout
1708394769 61753656 0 0 1897.11 62.70 0.059993 118604 1599.08 63.77 1 null 0 9.5 points
1708388194 61751546 0 0 1893.57 62.37 0.059994 1472986 1960.82 63.03 1 null 0 Timeout
1708386532 61751021 0 0 1880.14 62.44 0.059993 1178493 1808.33 61.66 0 null 0 5.5 points
1708020212 61619403 0 0 1893.76 62.52 0.059991 937108 1711.37 63.08 1 null 0 Resignation
1707683923 61500668 1 0 1887.81 62.44 0.059993 1233684 1654.56 74.29 1 null 0 55.5 points
1706890633 59621153 0 0 1883.06 62.23 0.059994 778702 1883.06 65.71 0 null 0 Resignation
1706873695 59621169 0 0 1894.32 62.34 0.059994 782566 1722.35 64.65 1 null 0 Resignation
1706303329 61023149 0 0 1888.14 62.27 0.059996 742404 2077.83 67.40 0 null 0 Resignation
1705809041 59621162 1 0 1893.86 62.16 0.059997 499807 1917.47 65.89 1 null 0 Resignation
1705530480 60758055 1 0 1881.88 62.26 0.059997 703157 1917.91 66.92 1 null 0 0.5 points
1704579880 60433086 0 0 1869.49 62.35 0.059996 1041259 1804.74 62.78 1 null 0 1.5 points
1704475861 59621158 0 0 1860.22 62.45 0.059997 1039 1934.75 66.68 0 null 0 Resignation
1703411584 59621147 1 0 1869.20 62.55 0.059998 130250 1678.06 62.75 0 null 0 Timeout
1701371020 59363707 0 0 1886.07 62.40 0.059993 868221 1770.27 60.65 1 null 0 Resignation
1701111807 59276987 0 0 1878.33 62.44 0.059994 1312374 1903.62 61.93 1 null 0 Resignation
1700863379 59194012 1 0 1866.16 62.56 0.059993 831416 1837.59 61.65 1 null 0 Resignation
1697742346 58121505 1 0 1855.67 62.69 0.059994 1299675 2068.89 84.74 1 null 0 Resignation

There’s some games where I lose to stronger players ~15 or so games down the table the deviation goes down slightly (62.5, 62.4 ish) then I lose to a lower rated player and it starts going up, 63, 64 and the other wins aren’t actually brining it back down until maybe the third entry but then I lose again a few times it goes back up, particularly when I resigned today against a 5kyu.

Amybot ddk for example is very different, the volatility is pretty much maxed out I think. I don’t think it can go above 0.15 for instance.

ended game_id played_black handicap rating deviation volatility opponent_id opponent_rating opponent_deviation outcome extra annulled result
1714931495 64018627 1 0 904.47 100.98 0.149693 869277 1069.30 62.40 1 null 0 9.5 points
1714931246 64018433 1 0 860.44 100.76 0.149622 424460 1362.87 76.82 0 null 0 Resignation
1714931076 64018554 0 0 863.72 98.27 0.149632 1419172 1002.92 62.03 0 null 0 Resignation
1714930909 64018527 0 0 882.02 98.45 0.149654 1419172 994.99 62.00 0 null 0 Resignation
1714930851 64018488 0 0 902.40 98.85 0.149674 1419172 1009.55 61.89 1 null 0 8.5 points
1714930635 64018387 0 0 864.06 98.93 0.149629 1419172 1002.37 61.82 0 null 0 Resignation
1714930545 64018047 1 0 882.70 99.23 0.149651 1501209 632.24 80.95 0 null 0 32.5 points
1714930487 64018342 0 0 928.78 97.95 0.149549 1419172 993.02 61.83 0 null 0 Resignation
1714930365 64018293 0 0 952.81 98.51 0.149561 1419172 982.60 61.84 0 null 0 Resignation
1714930234 64018247 0 0 980.01 99.21 0.149566 1419172 970.97 61.85 0 null 0 Resignation
1714930146 64017934 1 0 1010.94 99.97 0.149558 1505560 889.87 62.08 1 null 0 16.5 points
1714930066 64018156 0 0 990.53 100.56 0.149579 1419172 958.62 61.84 0 null 0 Resignation
1714929957 64017988 0 5 1024.33 101.45 0.149564 1472623 747.19 61.59 1 null 0 25.5 points
1714929910 64018104 0 0 1001.24 102.45 0.149581 1419172 968.99 61.85 1 null 0 0.5 points
1714929632 64017650 1 0 971.83 103.83 0.149586 1329510 1150.89 62.39 0 null 0 Resignation
1714929518 64017948 0 0 989.27 104.50 0.149610 545062 903.54 63.19 1 null 0 Resignation
1714929518 64017952 0 9 963.63 106.08 0.149625 1553152 321.80 74.21 1 null 0 Resignation
1714929496 64017824 0 4 959.72 103.99 0.149636 1472623 757.58 61.58 1 null 0 35.5 points
1714929444 64017925 0 0 931.88 105.59 0.149647 1554022 511.66 88.47 1 null 0 50.5 points
1714929293 64017742 0 9 926.24 103.92 0.149661 1553152 330.92 74.82 1 null 0 77.5 points
1714929281 64017861 0 0 921.33 102.00 0.149674 1554022 515.97 88.82 1 null 0 42.5 points
1714929066 64017570 0 5 915.65 100.16 0.149689 1472623 748.06 61.58 0 null 0 Resignation
1714928965 64017723 0 0 945.34 101.12 0.149688 1554022 519.89 89.09 1 null 0 Resignation
1714928767 64017634 0 0 940.31 99.08 0.149702 1554022 524.01 89.41 1 null 0 Resignation
1714928449 64017370 0 4 935.26 96.93 0.149717 1472623 759.10 61.59 1 null 0 55.5 points
1714928390 64017339 0 7 909.24 97.37 0.149722 1498231 691.26 77.09 0 null 0 20.5 points
1714927934 64017108 0 6 935.78 97.79 0.149725 1472623 750.75 61.57 0 null 0 Resignation
1714927822 64017054 0 0 961.96 98.36 0.149731 1163766 653.56 66.48 1 null 0 64.5 points
1714927571 64017102 1 2 953.46 96.98 0.149752 451385 1173.36 63.78 0 null 0 Resignation
1714927375 64016755 1 0 973.09 97.13 0.149772 1505560 875.53 62.02 0 null 0 24.5 points
1714927331 64016900 0 0 1009.35 97.05 0.149731 1501171 835.32 62.03 1 null 0 17.5 points
1714927310 64016993 1 2 993.93 96.76 0.149755 1080566 1017.30 62.89 1 null 0 8.5 points
1714927045 64016883 1 2 970.87 97.10 0.149768 1080566 1028.11 62.99 1 null 0 56.5 points
1714926966 64016928 0 4 944.99 97.55 0.149774 546626 887.28 61.88 1 null 0 14.5 points
1714926943 64016810 1 2 907.55 97.42 0.149728 451385 1166.78 63.74 0 null 0 Resignation
1714926784 64016722 0 0 924.46 97.35 0.149751 1501171 844.23 62.06 1 null 0 46.5 points
1714926747 64016776 1 2 902.03 97.74 0.149766 1080566 1017.34 63.09 0 null 0 Resignation
1714926669 64016793 0 0 929.27 98.30 0.149769 418160 919.08 70.87 1 null 0 20.5 points
1714926591 64016794 1 0 900.74 98.90 0.149768 43708 1383.35 75.38 0 null 0 8.5 points
1714926421 64016636 0 0 904.25 96.37 0.149779 1118872 905.42 62.22 0 null 0 Resignation
1714926382 64016468 1 0 932.47 96.69 0.149775 1505560 861.99 62.00 0 null 0 48.5 points
1714926268 64016486 0 0 966.44 96.74 0.149746 1501171 852.12 62.05 1 null 0 10.5 points
1714926223 64016646 0 4 946.93 96.86 0.149766 546626 900.74 62.04 0 null 0 Resignation
1714926194 64016516 1 2 965.80 96.93 0.149787 451385 1182.84 63.69 1 null 0 23.5 points
1714926071 64016425 0 0 928.87 96.71 0.149741 1118872 916.51 62.28 1 null 0 25.5 points
1714926060 64016601 0 4 901.33 97.10 0.149740 546626 894.28 61.94 0 null 0 Resignation
1714925704 64002724 0 0 917.61 96.92 0.149764 1272533 1071.26 61.76 0 null 0 33.5 points
1714925595 64016185 0 0 934.36 96.79 0.149787 1501171 837.83 61.99 0 null 0 1.5 points
1714925524 64016190 1 2 970.28 96.66 0.149747 451385 1174.51 63.76 0 null 0 10.5 points
1714925142 64016206 0 0 990.96 96.85 0.149765 1419172 956.59 61.84 0 null 0 Resignation
1714925065 64016157 0 0 1022.32 97.13 0.149749 1419172 966.18 61.85 1 null 0 15.5 points

Probably it makes sense since the bot has crazy results all the time, beating 13kyus and then losing to 25kyus. It’s almost a definition of volatile :slight_smile:

If I get back to a computer I should try make some graphs or something to plot volatility, deviation and outcome against each other some how and see if the behaviour is consistent or intuitive in some way.

But it does look like say Amy it’s deviation changes a lot more because of the high volatility

I remember someone saying before that it doesn’t, and the calculator seems to agree:

The updated deviation is independent of results. I agree that it doesn’t sound very reasonable, but if I remember correctly it has something to do with how we use a period consisting of only a single game.

1 Like

It does seem very odd, if it turns out the calculator is wrong it’s probably my fault trying to get the python code turned into javascript :slight_smile: Definitely some things are off like the win rate that I don’t think is shown anywhere.

1 Like

The deviation never goes very low because the period never includes more than one game. Using a time-based period would cause a player who plays more frequently to have lower deviation (all other things being equal).

5 Likes

The Glicko 2 formulas are beyond my math abilities, but I have a fuzzy idea of the notation so this may be naive.

Is it possible to use some hybrid Glicko where the rating update is based on single game periods, but the rating deviation and volatility updates are based on periods?

The idea would be to preserve the win = rating goes up (and loss = rating goes down) expected behavior, but also make the rating deviations and volatility somewhat more useful for the median active user? I.e. Set the period to a length where the median active user plays at least 5 games.

Done. The document is checked into the main branch of goratings and is called RatingsV6.md.

Beware:

  • I rushed this together fairly quickly and didn’t edit extensively, and @anoek graciously let me land it pretty much as-is even though it probably has lots of confusing bits and (surely!) some outright errors.
    • BTW, pull requests are welcome!
  • I didn’t capture everyone’s ideas. With limited time, I focused on the ones that I’ve been thinking the most about myself, especially where I’d thought through some plausibly useful math that I was afraid I’d forget.
    • If your idea isn’t in there, it hasn’t been rejected. I just didn’t write it down (yet).
    • BTW, pull requests are welcome! (Although see urgent stuff at the bottom for the collaboration I think is most useful right now…)
  • There’s a bunch of MathJAX / LaTeX / whatever… those bits are much easier to read in the browser than in the file when you’re editing. (I think it’s the right tradeoff? Math is hard in text files. Another option would be pseudo code or python code or something, but that’s harder to compare to the Glicko-2 paper… that’s why I stuck to math…)

The high-level summary is:

  • Improve the metrics
    • Review the current metrics, which are mostly targeting predictive performance, and remove the ones that we don’t find useful to reduce noise.
    • Expand the metrics for things we want to track but aren’t currently tracking (e.g., performance when rank diff is bigger than 1; historic volatility of ratings; etc.)
    • Show more cross-sectional views of the metrics (by rank? by board size? by game speed? human or bot?)
  • Develop a script that models the specific ratings categories (e.g., blitz-9x9) as the primary drivers, and tune how the ratings categories interact to improve their performance
    • I expect that with some effort we can make this much better than the status quo (where “overall” is the primary driver)
  • Model fixed-length (but start time variable per-user) time periods
    • Ideally, make ratings relatively stable for players who play relatively frequently and whose playing strength is relatively consistent, while still allowing quick changes when a player’s strength changes

I’m out (no coding, just whatever I can do easily from mobile) for a few months. The document currently answers the question, “What would @dexonsmith do in the goratings repo if he had a some dedicated time for experiments and no input from others?”

… but I would be delighted if others dove in and moved forward before I’m “back”, even if that means the direction evolves significantly from what I’ve envisioned. Please keep the document updated on the state of experiments, fix it where appropriate, and reorganize (or even delete) sections that are not useful (we have Git history to recover lost gold). If you include me on pull requests I’ll do my best to give feedback, and include @anoek for his comments/feedback as well.

Urgent stuff: (if you’ll allow me to call something urgent that I won’t be making time for myself until September-ish…):

  • Improve the metrics so that we’re measuring what we care about. (and/or adding/improving visualizations so it’s easier to understand the results)
  • Model specific-ratings-categories-as-primary in such a way that we use the metrics to evaluate it (in the document, I call this “cohesive-ratings-grid” or something).

If you have time to contribute, that’s what I’d most appreciate help with! (Once those things are done, it’ll be more interesting to consider ideas for how to make ratings better, since we’ll be able to evaluate them.)

5 Likes

Sorry I missed this comment before.

I’m not sure that your specific idea can work well—IMO we should update them together—but it’s certainly possible to preserve the win-means-rating-goes-up property when introducing time periods. The only thing that breaks that property is rolling periods, where you have a look-back window (of games, or of time) that includes different but overlapping subsets of games in each evaluation.

Note that my current idea for what to experiment with, documented in RatingsV6 linked above, doesn’t include rolling periods, so the win-means-rating-goes-up property would be preserved.

Also, note that @anoek is against breaking the win-means-rating-goes-up property (given past experience) unless we have really compelling data that it’s giving us some important win we can’t achieve another way. I’m currently optimistic we won’t need to break it.

2 Likes

I just got done reading the doc; it looks awesome!

1 Like