A couple of questions about handicap and rating

dexonsmith · May 2, 2024, 12:54am

I don’t agree, but maybe I’m missing something.

I’ll explain here why I think that even if a player’s strength is volatile (following some pseudorandom function), they should have a stable rating if they play games frequently enough.

Let’s take an extreme, and invent a bot 10k20k that has a 50% chance in each game of playing perfectly like a 10kyu or a perfectly like 20kyu (same strength throughout the game).

If you have a period-based system like Glicko-2, the graph you get is going to depend on how many games 10k20k plays in a period.

I think we get something like this for average playing strength in a period:

If 10k20k plays 1 game, there’s a 1/2 chance it played all games like a 10kyu in that period, 1/2 it played like a 20kyu.
- The average strength for a period is either going to be 10k or 20k.
If 10k20k plays 2 games, there’s a 1/2 it played half and half, 1/4 chance it played all games like a 10kyu,1/4 it played all like a 20kyu.
- The average playing strength for a period has a 50% chance of being 15k; 50% it played all 10k or all 20k.
If 10k20k plays 10 games, there’s a 252/1024 chance it played half and half, 672/1024 chance it played 40-60% like 10k, …, 2/1024 it played all 10k or all 20k.
- The average playing strength for a period has ~25% chance of being 15k; ~65% chance of being 14k between 16k; ~98% chance of between between 12k and 18k; only ~0.2% chance of being all 10k or all 20k.
If 10k20k plays 800 games, there’s a 1/2^800 chance it played all games like a 10kyu, same for 20kyu. Balance of probability is that it played a lot of games in each mode (it’s quite rare to get mostly heads in 800 coin tosses).
- The average playing strength for a period is almost always close to 15k.

If you look at consecutive periods at 1, 2, 10, or 800 games played per period:

1 game / period: average playing strength bounces between 10k and 20k
2 games / period: average playing strength bounces between 15k (half the time), 10k (quarter of the time), 20k (quarter of the time)
10 games / period: average playing strength is between 14k and 16k ~65% of the time, between 12k and 18k ~98% of the time, on the extremes of 10k or 20k only ~0.2% of the time.
800 games / period: average playing strength almost always close to 15k, but very rarely more extreme.

If you have a time-independent system like the current OGS implementation of Glicko-2, then the graph you get is the same no matter how frequently 10k20k plays. If you plot it against time, just zoom in or out to see the exact same curve.

Looking at average playing strength across consecutive periods, you always have:

1 game / period: average playing strength bounces between 10k and 20k

The above just looked at “average playing strength” over a period. Ratings are a level of abstraction built on top of that: current rating is the previous rating adjusted by playing strength in this period. In more detail:

A player’s rating after a period is roughly equal to this sum (I might be over-simplifying):

Previous rating
Weighted (by opponent deviation) average of, for each game result:
- Difference between actual game result (0, 1, or 0.5) and expected game result (number between 0 and 1 that depends on opponent’s rating and deviation after adjusting for handicap)

If you have more games per period, the average playing strength is more stable (most periods, it’ll be near the true average), and so the ratings that are built on average playing strength are also more stable.

Regardless, I don’t care all that much if bots have crappy ratings. I do care that playing games more often doesn’t result in a clearer picture of a player’s rating, and instead causes the rating to jump around more quickly. The bots are just an extreme example of the current implementation’s flaws in this regard.