A couple of questions about handicap and rating

I think we can all makes mistakes once here and there, we can be tired, we can hallucinate, but it’s an entirely different matter to do it 1000 times in a day right?

I just don’t believe it when some bots aren’t playing at a sensible rating. You can’t assign a stable rating to something inherently unstable. If I were to play like a 10 kyu some games where my opponent resigns early but then when we go to scoring I decide to self Atari and throw a large fraction of my winning games, you’re not going to be able to assign a sensible rating especially if there’s not an obvious distribution to say if and when I’ll behave like that. It’s just noise.

1 Like

I don’t agree, but maybe I’m missing something.

I’ll explain here why I think that even if a player’s strength is volatile (following some pseudorandom function), they should have a stable rating if they play games frequently enough.

Let’s take an extreme, and invent a bot 10k20k that has a 50% chance in each game of playing perfectly like a 10kyu or a perfectly like 20kyu (same strength throughout the game).

If you have a period-based system like Glicko-2, the graph you get is going to depend on how many games 10k20k plays in a period.

I think we get something like this for average playing strength in a period:

  • If 10k20k plays 1 game, there’s a 1/2 chance it played all games like a 10kyu in that period, 1/2 it played like a 20kyu.
    • The average strength for a period is either going to be 10k or 20k.
  • If 10k20k plays 2 games, there’s a 1/2 it played half and half, 1/4 chance it played all games like a 10kyu,1/4 it played all like a 20kyu.
    • The average playing strength for a period has a 50% chance of being 15k; 50% it played all 10k or all 20k.
  • If 10k20k plays 10 games, there’s a 252/1024 chance it played half and half, 672/1024 chance it played 40-60% like 10k, …, 2/1024 it played all 10k or all 20k.
    • The average playing strength for a period has ~25% chance of being 15k; ~65% chance of being 14k between 16k; ~98% chance of between between 12k and 18k; only ~0.2% chance of being all 10k or all 20k.
  • If 10k20k plays 800 games, there’s a 1/2^800 chance it played all games like a 10kyu, same for 20kyu. Balance of probability is that it played a lot of games in each mode (it’s quite rare to get mostly heads in 800 coin tosses).
    • The average playing strength for a period is almost always close to 15k.

If you look at consecutive periods at 1, 2, 10, or 800 games played per period:

  • 1 game / period: average playing strength bounces between 10k and 20k
  • 2 games / period: average playing strength bounces between 15k (half the time), 10k (quarter of the time), 20k (quarter of the time)
  • 10 games / period: average playing strength is between 14k and 16k ~65% of the time, between 12k and 18k ~98% of the time, on the extremes of 10k or 20k only ~0.2% of the time.
  • 800 games / period: average playing strength almost always close to 15k, but very rarely more extreme.

If you have a time-independent system like the current OGS implementation of Glicko-2, then the graph you get is the same no matter how frequently 10k20k plays. If you plot it against time, just zoom in or out to see the exact same curve.

Looking at average playing strength across consecutive periods, you always have:

  • 1 game / period: average playing strength bounces between 10k and 20k

The above just looked at “average playing strength” over a period. Ratings are a level of abstraction built on top of that: current rating is the previous rating adjusted by playing strength in this period. In more detail:

A player’s rating after a period is roughly equal to this sum (I might be over-simplifying):

  • Previous rating
  • Weighted (by opponent deviation) average of, for each game result:
    • Difference between actual game result (0, 1, or 0.5) and expected game result (number between 0 and 1 that depends on opponent’s rating and deviation after adjusting for handicap)

If you have more games per period, the average playing strength is more stable (most periods, it’ll be near the true average), and so the ratings that are built on average playing strength are also more stable.


Regardless, I don’t care all that much if bots have crappy ratings. I do care that playing games more often doesn’t result in a clearer picture of a player’s rating, and instead causes the rating to jump around more quickly. The bots are just an extreme example of the current implementation’s flaws in this regard.

3 Likes

My impression (without having actually studied the numbers) is that a big difference with bots is this:

  • A bot will happily play 20 ranked games in a row with the same player, losing every time due to some flaw in the bot’s algorithm.
  • A person can have plenty of flaws but after losing a couple of times will either learn to fix them or stop accepting the matches.

You know, there are people like this:

The same thing applies when bots win 20 times against the same player - either way it will obviously lead to a ton of variability if those games are treated the same as 20 games against 20 different players.

If a new ranking system is on the table, I suggest strongly down-weighting repeated games between the same pair of players somehow, similarly to how we handle repeated correspondence timeouts.

3 Likes

Airbagging fron 25k to 18k, what a brilliant start for a go player! :face_with_spiral_eyes:
I sincerely hope he finally enjoyed some much better experience with a human player since then. :handshake:

2 Likes

I think maybe one thing you’re not taking into account properly is the opponents playing strengths which are also random.

The 10k20k bot will randomly decide to play like 10k against 11ks, 12ks etc which will start to bring the rank up, and then like a 20k against 25 or below kyus, and this causes massive fluctuations in the rank. Even if it happened to be hovering around 15kyu temporarily, there’s huge upsets happening all the time, which leads to large volatility and large changes in the ratings.

So there’s no stabilising effect on the ratings with more games.

And I’m arguing that it’s slightly flawed to look at certain bots where I disagree that many rating systems will be able to assign a sensible rank to a bot like 10k20k or amybot ddk.

So I don’t think it’s actually indicative of a flaw in the current system.

I do think that OGS ratings can be overly volatile for established players, but I don’t know if that is due to OGS somehow misusing Glicko-2 or perhaps some misconfiguration.

I’m sorry, I am generally fairly interested in rating systems and I kind of like talking about it.
But I can’t really comment when it involves Glicko-2 (or something else significantly more complicated than Elo or Bradley-Terry). I just don’t have the knowledge about Glicko-2 and I’m not interested enough in rating systems to delve into the details of Glicko-2.

1 Like

It could also be by design. A quickly adjusting rank helps players get near the right rank faster

1 Like

This. But I guess that would make 20k10k appear as ‘?’.

Have you thought about preventing “I won, but my rating went down”, when reintroducing sliding windows?

1 Like

What does lichess do? that said, at this point I’d just bite the bullet

I think it’s an interesting thought experiment, but I’m not sure why you’re sharing that with me? I did not say that the goal of the rating system is to predict outcomes - it’s to enable better matchmaking.

Let’s take this in the other direction: imagine we have remove volatility altogether since some people think it’s a bug… everyone stays at their initial rank forever. That system is hardly useful either!

1 Like

I think watching micro-adjustments in rank is the most ridiculous hobby ever :joy: it’s really absurd that the rank system needs to accommodate this

5 Likes

I suppose that good matchmaking means looking for opponents against whom the predicted outcome is about 50% winrate. How else would you implement good matchmaking?

1 Like

Yes, that is what I mean when I say good matchmaking.

Edit: oh I think there may be confusion around the terminology. When I said “predict outcomes” I meant predicting the result as Uberdude’s post was describing. A 50% winrate is not a result - “Black wins” is a result

2 Likes

You would think so, but I think the provisional rating cutoff is still quite high, 160 is high given we know that amybot ddk wildly fluctuates in strength and their variance is only around 100

Not sure what kinds of results it would take to bring it back up to 160.

I think something on the order of 60-70 seems pretty stable.

I’m not sure how low the derivation can really go either.

Deviation doesn’t depend on game results, right?

Wow, so much discussion since I was last here a couple of days ago! Thanks shinuito for the link to 2020 Rating and rank tweaks and analysis . I haven’t read all 249 posts in that thread. But just looking at the tables near the top – blitz ranks predicting blitz results, overall ranks predicting blitz results, etc – I can’t help wondering if there’s a conceptual error in the analysis method.

Consider:

  • Player A is better at blitz than live, and has a higher blitz ranking. Therefore they win only 30% of their live games due to being overranked, while the prediction based on their overall rank is 50%.
  • Player B is better at live than blitz, so they win 70% of live compared to predicted 50%.

If there’s an equal number of games played by type A and type B players, then when you aggregate over all games, the errors will cancel out, so on average it looks like the system is working perfectly. Even if there’s more type A than type B (or vice versa), if those groups are only about 5% of all games, then you’ll have a bunch of unhappy people who are invisible in the aggregate statistics.

So what’s needed to resolve this is:

  • Classify players into type A (blitz rank significantly higher than live rank, maybe >2 ranks difference?), type B (other way round) and “normal”.
  • Count what proportion of players are in each group. Decide if groups A and B are big enough to care about.
  • If we care, then check the prediction accuracy separately per group.

I respect any admin decisions regarding the effort and risk of changing things here versus the benefit, and fully understand if it’s just not a big enough problem to be worth addressing. But as a starting point, it would be nice to know if we’re just talking about five users in total or if there’s hundreds of people thinking they need to create multiple accounts to keep their blitz and live ranks separate.

5 Likes

If we’re going to evaluate ranking/matchmaking systems, it’s important to have some idea of what the goal is, and just saying that you are targeting a win rate of 50% (or close) is not enough, is it? Assigning ranks completely randomly gives you a win rate of 50%.

If the ranking system predicts a 50% winrate for some game, does its outcome actually tell you anything about the quality of that ranking system?

2 Likes

You’d need to track individual players over a longer sequence of games.
If the ratings system ranks all players either 30k or 9d depending on their last result, and always predicts a 50% chance of winning, you’ll find some players (who are actually 30k) who score much less than 50% and some other players (who are actually 9d) who score much more than 50%.
This rating system may seem to predict correctly overall (after all, every game is won by 50% of the 2 players), but for many individual players its predictions will be way off.

But don’t get your expectations up too high. Even predictions of a very good rating system will be wrong some 40% of the time when most games are quite evenly matched between players less than 100 Elo rating points apart. It would mean that most game results are close to coin flips, and it’s hard to predict those correctly more than 50% of the time. Because of the inherent randomness of such an environment, you need to take large samples to assess the quality ot a rating system.

3 Likes

As you say, it’s a thought experiment that takes “measure of quality of rating system is ability to predict game outcomes” to its logical extreme to show that is a poor metric to be the sole one to measure. Not claiming you were a proponent of that, and like you say matchmaking is another very important one, but the whole analysis we were talking about of overall vs specific ranks was measured against game outcome prediction. Perhaps a better, though much harder, metric would be “player happiness that they were paired with people of similar strength” (e.g. not like this guy Impossible to detect sandbaggers/lower ranked people). The whole idea of a person having a rank is based on some averaging, and we need to find the sweet spot.

3 Likes