A couple of questions about handicap and rating

Feijoa · May 5, 2024, 3:46pm

You’re still attempting to use what seems to me to be the wrong goal - predicting the result of evenly-matched games. Yes, you might as well flip a coin for those.

The only way one rating system could be better than another is if their predictions differ - one of them might predict that White has the advantage while the other thinks Black does. Then depending on which way the game goes, we gain a little evidence about which system might be better.

gennan · May 5, 2024, 4:07pm

I think that when the rating gap is the same, predictions would be the same, at least if the rating system is some sort of Elo rating system. When the rating gap between opponents gets larger, the win prediction would be skewed further away from 50%.
But Elo based rating systems can still differ in the scale of the rating change of both players when a game result is processed, which affects predictions for their future games.

In a regular Elo rating system, as used in chess, the rating change sizes are scaled by the K factor.
A large K factor results in more rating volatility and a low K factor results in less rating volatility.
Chess rating systems usually use a high K factor (more volatile rating) for lower rated players and a low K factor (less volatile ratings) for higher rated players. The EGF rating system has a similar scale factor called con, which depends on the current rating of a player. (EGF ratings system | E.G.D. - European Go Database)
I think that Glicko-2 has some similar rating volatility mechanism, but I don’t know much about it. From what I understand, it not only depends on the current rating of a player, but also on the recent result history of a player.

Feijoa · May 5, 2024, 4:17pm

What I mean is that if we look at a particular game, rating system A might think Black is 12k and White is 10k, while rating system B thinks Black is 11k and White is 15k.

Then if, say, Black wins, that is some kind of evidence that system B is better than system A.

To compare systems, mismatched games are the ones we should be looking for.

gennan · May 5, 2024, 4:30pm

An Elo rating system is in itself already a sort of logistic regression system, continuously minimising apparent mismatches between its expectations and actual events.

If a rating system is pretty bad, I think you’ll already find mismatches between players with the same rating, where some of them are evidently much stronger than others (winrates far away from 50%).

A second order way in which a rating system can be bad (but it may be less obvious), is when almost all equally rated players actually have a 50% winrate against others of the same rating (looking good), but rating gaps are scaled wrong, so that winrates between unevenly matched players are much less or much more skewed than the rating system predicts.

shinuito · May 5, 2024, 5:41pm

I’m pretty sure it does but in a very complicated way.

I mean it has to right?

It looks like you compute volatility which is based on the results of the games, and some iterative algorithm, this then makes some preperiod guess at the new deviation, and then that gets updated again with a quantity v which depends on all the games results.

shinuito · May 5, 2024, 6:33pm

@Feijoa If we look at say some of my recent games with the api

ended	game_id	played_black	handicap	rating	deviation	volatility	opponent_id	opponent_rating	opponent_deviation	outcome	extra	result
1714928783	61712295	1	0	1841.87	64.33	0.059997	1208231	1551.76	118.29	0	null	Resignation
1714854525	63990994	1	0	1860.62	64.01	0.059991	681030	2052.45	62.59	0	null	Timeout
1712514504	63187854	0	0	1866.62	63.98	0.059992	614879	1960.43	63.78	0	null	Resignation
1712090237	63041688	1	0	1875.45	64.15	0.059994	788871	1786.14	86.31	0	null	Timeout
1710353740	62048349	0	1	1890.08	64.27	0.059992	811318	1720.78	65.21	1	null	Resignation
1709891931	61712293	1	0	1881.16	64.45	0.059993	868221	1739.64	61.03	1	null	Resignation
1709834427	61712292	0	0	1873.67	64.56	0.059994	54729	1755.47	63.12	1	null	Resignation
1709681115	62205925	0	0	1865.44	64.73	0.059996	1348548	1366.91	62.24	1	null	64.5 points
1709679034	62205283	0	0	1864.10	64.12	0.059996	1348548	1368.17	61.57	1	null	143.5 points
1709656550	61712294	0	0	1862.76	63.50	0.059997	915719	1545.64	62.61	1	null	Resignation
1709603311	62178683	0	0	1859.45	63.16	0.059998	1524321	1372.31	95.82	1	null	63.5 points
1709601311	62178033	0	0	1858.03	62.53	0.059999	1285501	1684.37	61.53	0	null	Resignation
1709599180	62177342	1	0	1874.50	62.42	0.059994	1497548	1864.64	61.19	1	null	Resignation
1709576028	62168430	1	4	1863.45	62.55	0.059994	38364	2238.99	65.10	0	null	Resignation
1709557391	62161289	1	0	1876.32	62.65	0.059993	1400433	2032.38	63.05	0	null	Timeout
1709375297	62098060	0	0	1882.98	62.63	0.059995	241688	1797.27	61.23	0	null	Timeout
1708394769	61753656	0	0	1897.11	62.70	0.059993	118604	1599.08	63.77	1	null	9.5 points
1708388194	61751546	0	0	1893.57	62.37	0.059994	1472986	1960.82	63.03	1	null	Timeout
1708386532	61751021	0	0	1880.14	62.44	0.059993	1178493	1808.33	61.66	0	null	5.5 points
1708020212	61619403	0	0	1893.76	62.52	0.059991	937108	1711.37	63.08	1	null	Resignation
1707683923	61500668	1	0	1887.81	62.44	0.059993	1233684	1654.56	74.29	1	null	55.5 points
1706890633	59621153	0	0	1883.06	62.23	0.059994	778702	1883.06	65.71	0	null	Resignation
1706873695	59621169	0	0	1894.32	62.34	0.059994	782566	1722.35	64.65	1	null	Resignation
1706303329	61023149	0	0	1888.14	62.27	0.059996	742404	2077.83	67.40	0	null	Resignation
1705809041	59621162	1	0	1893.86	62.16	0.059997	499807	1917.47	65.89	1	null	Resignation
1705530480	60758055	1	0	1881.88	62.26	0.059997	703157	1917.91	66.92	1	null	0.5 points
1704579880	60433086	0	0	1869.49	62.35	0.059996	1041259	1804.74	62.78	1	null	1.5 points
1704475861	59621158	0	0	1860.22	62.45	0.059997	1039	1934.75	66.68	0	null	Resignation
1703411584	59621147	1	0	1869.20	62.55	0.059998	130250	1678.06	62.75	0	null	Timeout
1701371020	59363707	0	0	1886.07	62.40	0.059993	868221	1770.27	60.65	1	null	Resignation
1701111807	59276987	0	0	1878.33	62.44	0.059994	1312374	1903.62	61.93	1	null	Resignation
1700863379	59194012	1	0	1866.16	62.56	0.059993	831416	1837.59	61.65	1	null	Resignation
1697742346	58121505	1	0	1855.67	62.69	0.059994	1299675	2068.89	84.74	1	null	Resignation

There’s some games where I lose to stronger players ~15 or so games down the table the deviation goes down slightly (62.5, 62.4 ish) then I lose to a lower rated player and it starts going up, 63, 64 and the other wins aren’t actually brining it back down until maybe the third entry but then I lose again a few times it goes back up, particularly when I resigned today against a 5kyu.

Amybot ddk for example is very different, the volatility is pretty much maxed out I think. I don’t think it can go above 0.15 for instance.

ended	game_id	played_black	handicap	rating	deviation	volatility	opponent_id	opponent_rating	opponent_deviation	outcome	extra	result
1714931495	64018627	1	0	904.47	100.98	0.149693	869277	1069.30	62.40	1	null	9.5 points
1714931246	64018433	1	0	860.44	100.76	0.149622	424460	1362.87	76.82	0	null	Resignation
1714931076	64018554	0	0	863.72	98.27	0.149632	1419172	1002.92	62.03	0	null	Resignation
1714930909	64018527	0	0	882.02	98.45	0.149654	1419172	994.99	62.00	0	null	Resignation
1714930851	64018488	0	0	902.40	98.85	0.149674	1419172	1009.55	61.89	1	null	8.5 points
1714930635	64018387	0	0	864.06	98.93	0.149629	1419172	1002.37	61.82	0	null	Resignation
1714930545	64018047	1	0	882.70	99.23	0.149651	1501209	632.24	80.95	0	null	32.5 points
1714930487	64018342	0	0	928.78	97.95	0.149549	1419172	993.02	61.83	0	null	Resignation
1714930365	64018293	0	0	952.81	98.51	0.149561	1419172	982.60	61.84	0	null	Resignation
1714930234	64018247	0	0	980.01	99.21	0.149566	1419172	970.97	61.85	0	null	Resignation
1714930146	64017934	1	0	1010.94	99.97	0.149558	1505560	889.87	62.08	1	null	16.5 points
1714930066	64018156	0	0	990.53	100.56	0.149579	1419172	958.62	61.84	0	null	Resignation
1714929957	64017988	0	5	1024.33	101.45	0.149564	1472623	747.19	61.59	1	null	25.5 points
1714929910	64018104	0	0	1001.24	102.45	0.149581	1419172	968.99	61.85	1	null	0.5 points
1714929632	64017650	1	0	971.83	103.83	0.149586	1329510	1150.89	62.39	0	null	Resignation
1714929518	64017948	0	0	989.27	104.50	0.149610	545062	903.54	63.19	1	null	Resignation
1714929518	64017952	0	9	963.63	106.08	0.149625	1553152	321.80	74.21	1	null	Resignation
1714929496	64017824	0	4	959.72	103.99	0.149636	1472623	757.58	61.58	1	null	35.5 points
1714929444	64017925	0	0	931.88	105.59	0.149647	1554022	511.66	88.47	1	null	50.5 points
1714929293	64017742	0	9	926.24	103.92	0.149661	1553152	330.92	74.82	1	null	77.5 points
1714929281	64017861	0	0	921.33	102.00	0.149674	1554022	515.97	88.82	1	null	42.5 points
1714929066	64017570	0	5	915.65	100.16	0.149689	1472623	748.06	61.58	0	null	Resignation
1714928965	64017723	0	0	945.34	101.12	0.149688	1554022	519.89	89.09	1	null	Resignation
1714928767	64017634	0	0	940.31	99.08	0.149702	1554022	524.01	89.41	1	null	Resignation
1714928449	64017370	0	4	935.26	96.93	0.149717	1472623	759.10	61.59	1	null	55.5 points
1714928390	64017339	0	7	909.24	97.37	0.149722	1498231	691.26	77.09	0	null	20.5 points
1714927934	64017108	0	6	935.78	97.79	0.149725	1472623	750.75	61.57	0	null	Resignation
1714927822	64017054	0	0	961.96	98.36	0.149731	1163766	653.56	66.48	1	null	64.5 points
1714927571	64017102	1	2	953.46	96.98	0.149752	451385	1173.36	63.78	0	null	Resignation
1714927375	64016755	1	0	973.09	97.13	0.149772	1505560	875.53	62.02	0	null	24.5 points
1714927331	64016900	0	0	1009.35	97.05	0.149731	1501171	835.32	62.03	1	null	17.5 points
1714927310	64016993	1	2	993.93	96.76	0.149755	1080566	1017.30	62.89	1	null	8.5 points
1714927045	64016883	1	2	970.87	97.10	0.149768	1080566	1028.11	62.99	1	null	56.5 points
1714926966	64016928	0	4	944.99	97.55	0.149774	546626	887.28	61.88	1	null	14.5 points
1714926943	64016810	1	2	907.55	97.42	0.149728	451385	1166.78	63.74	0	null	Resignation
1714926784	64016722	0	0	924.46	97.35	0.149751	1501171	844.23	62.06	1	null	46.5 points
1714926747	64016776	1	2	902.03	97.74	0.149766	1080566	1017.34	63.09	0	null	Resignation
1714926669	64016793	0	0	929.27	98.30	0.149769	418160	919.08	70.87	1	null	20.5 points
1714926591	64016794	1	0	900.74	98.90	0.149768	43708	1383.35	75.38	0	null	8.5 points
1714926421	64016636	0	0	904.25	96.37	0.149779	1118872	905.42	62.22	0	null	Resignation
1714926382	64016468	1	0	932.47	96.69	0.149775	1505560	861.99	62.00	0	null	48.5 points
1714926268	64016486	0	0	966.44	96.74	0.149746	1501171	852.12	62.05	1	null	10.5 points
1714926223	64016646	0	4	946.93	96.86	0.149766	546626	900.74	62.04	0	null	Resignation
1714926194	64016516	1	2	965.80	96.93	0.149787	451385	1182.84	63.69	1	null	23.5 points
1714926071	64016425	0	0	928.87	96.71	0.149741	1118872	916.51	62.28	1	null	25.5 points
1714926060	64016601	0	4	901.33	97.10	0.149740	546626	894.28	61.94	0	null	Resignation
1714925704	64002724	0	0	917.61	96.92	0.149764	1272533	1071.26	61.76	0	null	33.5 points
1714925595	64016185	0	0	934.36	96.79	0.149787	1501171	837.83	61.99	0	null	1.5 points
1714925524	64016190	1	2	970.28	96.66	0.149747	451385	1174.51	63.76	0	null	10.5 points
1714925142	64016206	0	0	990.96	96.85	0.149765	1419172	956.59	61.84	0	null	Resignation
1714925065	64016157	0	0	1022.32	97.13	0.149749	1419172	966.18	61.85	1	null	15.5 points

Probably it makes sense since the bot has crazy results all the time, beating 13kyus and then losing to 25kyus. It’s almost a definition of volatile

If I get back to a computer I should try make some graphs or something to plot volatility, deviation and outcome against each other some how and see if the behaviour is consistent or intuitive in some way.

But it does look like say Amy it’s deviation changes a lot more because of the high volatility

Feijoa · May 5, 2024, 6:35pm

I remember someone saying before that it doesn’t, and the calculator seems to agree:

The updated deviation is independent of results. I agree that it doesn’t sound very reasonable, but if I remember correctly it has something to do with how we use a period consisting of only a single game.

shinuito · May 5, 2024, 6:41pm

It does seem very odd, if it turns out the calculator is wrong it’s probably my fault trying to get the python code turned into javascript Definitely some things are off like the win rate that I don’t think is shown anywhere.

dexonsmith · May 5, 2024, 7:05pm

The deviation never goes very low because the period never includes more than one game. Using a time-based period would cause a player who plays more frequently to have lower deviation (all other things being equal).

jgk · May 6, 2024, 7:49am

The Glicko 2 formulas are beyond my math abilities, but I have a fuzzy idea of the notation so this may be naive.

Is it possible to use some hybrid Glicko where the rating update is based on single game periods, but the rating deviation and volatility updates are based on periods?

The idea would be to preserve the win = rating goes up (and loss = rating goes down) expected behavior, but also make the rating deviations and volatility somewhat more useful for the median active user? I.e. Set the period to a length where the median active user plays at least 5 games.

dexonsmith · May 24, 2024, 10:49pm

Done. The document is checked into the main branch of goratings and is called RatingsV6.md.

Beware:

I rushed this together fairly quickly and didn’t edit extensively, and @anoek graciously let me land it pretty much as-is even though it probably has lots of confusing bits and (surely!) some outright errors.
- BTW, pull requests are welcome!
I didn’t capture everyone’s ideas. With limited time, I focused on the ones that I’ve been thinking the most about myself, especially where I’d thought through some plausibly useful math that I was afraid I’d forget.
- If your idea isn’t in there, it hasn’t been rejected. I just didn’t write it down (yet).
- BTW, pull requests are welcome! (Although see urgent stuff at the bottom for the collaboration I think is most useful right now…)
There’s a bunch of MathJAX / LaTeX / whatever… those bits are much easier to read in the browser than in the file when you’re editing. (I think it’s the right tradeoff? Math is hard in text files. Another option would be pseudo code or python code or something, but that’s harder to compare to the Glicko-2 paper… that’s why I stuck to math…)

The high-level summary is:

Improve the metrics
- Review the current metrics, which are mostly targeting predictive performance, and remove the ones that we don’t find useful to reduce noise.
- Expand the metrics for things we want to track but aren’t currently tracking (e.g., performance when rank diff is bigger than 1; historic volatility of ratings; etc.)
- Show more cross-sectional views of the metrics (by rank? by board size? by game speed? human or bot?)
Develop a script that models the specific ratings categories (e.g., blitz-9x9) as the primary drivers, and tune how the ratings categories interact to improve their performance
- I expect that with some effort we can make this much better than the status quo (where “overall” is the primary driver)
Model fixed-length (but start time variable per-user) time periods
- Ideally, make ratings relatively stable for players who play relatively frequently and whose playing strength is relatively consistent, while still allowing quick changes when a player’s strength changes

I’m out (no coding, just whatever I can do easily from mobile) for a few months. The document currently answers the question, “What would @dexonsmith do in the goratings repo if he had a some dedicated time for experiments and no input from others?”

… but I would be delighted if others dove in and moved forward before I’m “back”, even if that means the direction evolves significantly from what I’ve envisioned. Please keep the document updated on the state of experiments, fix it where appropriate, and reorganize (or even delete) sections that are not useful (we have Git history to recover lost gold). If you include me on pull requests I’ll do my best to give feedback, and include @anoek for his comments/feedback as well.

Urgent stuff: (if you’ll allow me to call something urgent that I won’t be making time for myself until September-ish…):

Improve the metrics so that we’re measuring what we care about. (and/or adding/improving visualizations so it’s easier to understand the results)
Model specific-ratings-categories-as-primary in such a way that we use the metrics to evaluate it (in the document, I call this “cohesive-ratings-grid” or something).

If you have time to contribute, that’s what I’d most appreciate help with! (Once those things are done, it’ll be more interesting to consider ideas for how to make ratings better, since we’ll be able to evaluate them.)

dexonsmith · May 24, 2024, 11:09pm

Sorry I missed this comment before.

I’m not sure that your specific idea can work well—IMO we should update them together—but it’s certainly possible to preserve the win-means-rating-goes-up property when introducing time periods. The only thing that breaks that property is rolling periods, where you have a look-back window (of games, or of time) that includes different but overlapping subsets of games in each evaluation.

Note that my current idea for what to experiment with, documented in RatingsV6 linked above, doesn’t include rolling periods, so the win-means-rating-goes-up property would be preserved.

Also, note that @anoek is against breaking the win-means-rating-goes-up property (given past experience) unless we have really compelling data that it’s giving us some important win we can’t achieve another way. I’m currently optimistic we won’t need to break it.

Samraku · May 25, 2024, 9:49am

I just got done reading the doc; it looks awesome!