A couple of questions about handicap and rating

dexonsmith · May 1, 2024, 8:41pm

FTR, my experience is that anoek is open to all changes to ratings. But, OGS needs to be very conservative about:

collecting data (via goratings) ahead of time, to be sure it will be an improvement
frequency of changing the live system, because each one is disruptive and expensive (since it involves recalculating the full history of ratings)

Agreed. I floated this with anoek a few months ago and he agrees it’s worth looking into. I haven’t yet had time to implement this in the goratings repo. If someone else is motivated and has the time, please loop me in to the investigations because I’ve already done some thinking. (I have the same username on GitHub.)

(By the way, the term Glicko-2 uses for uncertainty is “deviation”. This is the term of art used in the goratings repo.)

This is correct; currently, there is currently no “time” element to increase uncertainty on OGS.

Glicko-2 is designed to be time-based, whereby games are evaluated together in a “period” (of some length of time), and ratings+deviation are recalculated at the end of each period. A period with no games would increase the deviation without changing the rating.
The implementation here evaluates each game isolation, as if it’s the only game in a period (of variable length).
The goratings repo does have (behind a flag IIRC?) some degradation of deviation over time. I landed this a few months ago. It’s a start, but IMO not enough.

A third issue, not discussed in this thread yet I think, is that accounts with players that play lots of games can see massive fluctuations in ratings over a short period.

For example, amybot-ddk, which is a bot account, has fluctuated between 7.6k and 24.1k in the currently visible rating history (last 5000 games, which is about a week for this bot).

A player whose strength isn’t changing that has 5000 games per week should have a fairly stable rating with a very low deviation. But the deviation never gets as small as one might expect.

There are probably a number of factors contributing to this volatility in ratings and surprisingly high deviation.

“Time” isn’t used in the OGS ratings calculations. This means that 10-20 game winning or losing streaks cause wild swings in ratings, even if over the course of a day the bot plays consistently.
Maybe amybot-ddk has a different strength for different board sizes.
If a player times out when playing a bot, the game is not rated. So “resign by abandonment” depresses bot ratings. And since bots don’t care either way, humans don’t consider it rude to abandon games to them.

It’s not just bots. I think ratings are a bit weirdly volatile on OGS. Anecdotally, it seems like the more games you play, the more your rating fluctuates. But in a properly function system, it should be the other way around (unless your playing strength is actually changing).

Over the next few weeks, I’m hoping to find time to write up a proposal / summary document of what I think should happen, why, and what data need to be collected to support the changes. But here are the high-level pieces of what should change (with some discussion of why, but no data!):

Add a time element.
- E.g., a sliding window of one week.
- But probably something more complex (e.g., sliding window of 1 day, but look back up to a month to try to get a minimum of 10 games; if looking back, “age” the starting rating to only 1 day old before running the update).
Make each rating category (mostly) independent. E.g., something like this (but need data to fine-tune):
- If this rating category (e.g., live-9x9) has a game in the last month(?), just use it.
- Else, compute a blended rating from the parent rating categories (e.g., live and 9x9) by computing a weighted average and combining the parent deviations; if the blended rating ends up with a lower deviation, use that.
Re-evaluate whether (and if “yes”, when) correspondence timeouts should be annulled.
- The primary harm from NOT annulling correspondence timeouts is that the returning player needs to defeat “a lot of opponents” to get back to their true playing strength.
- I think it’s possible that we don’t need this protection anymore. Glicko-2 allows ratings to change quickly (won’t be too many trounced opponents). Also, only the correspondence rating will have tanked, and if they’re gone long enough, they’ll fallback to the blended rating anyway.
Stop annulling games when players abandon bot games. Give bots the win.
- If players want to abandon bot games without it affecting their ratings, they should use “unranked”.
- Else, we should assume the player abandoned because they felt they were losing.