Like this GitHub - breakwa11/GoAIRatings: Estimate Go AI ratings by real games
Excuse my obsession with measurements.
Like this GitHub - breakwa11/GoAIRatings: Estimate Go AI ratings by real games
It’s just capped I think. There’s a minimum and maximum rating
Interesting. If I was a programming wizard I’d like to see what happens without the caps.
I’d assume it would just keep going down.
I imagine totally random play loses to any real player 100% or the time, unless they just happen to resign out of frustration, boredom, or some other reason like trying to lower their rank
Not with high enough handicap! Unfortunately there’s no easy way to make it insist on that.
If some stable fraction of players would lose against it like that, the random player’s rating would eventually settle around some rating consistent with that winrate.
But there can be players much worse than a random player, like a player that resigns as soon as the game starts to count as rated (you can resign when it’s not your turn).
So the player following that strategy 24/7 while having the shortest lag would have the quickest dropping rating.
If it was going to settle, as in fluctuate around, because it’s losing games to players at certain levels, I imagine it won’t settle in the usual sense.
The volatility probably will shoot up in glicko2, and then the adjustments might keep being large, the way we see it happen for bot accounts.
The thing is though, if you have a player that should lose 100% of the time to another player, they can even really be put on the same rating scale.
Glicko and Elo, at least if unmodified, should give any two players a small chance of an upset.
I think the uniform random bot say should lose 100% of the time in theory ti any human player, and so shouldn’t be on the same rating scale. Hence it should diverge to -inf as they keep playing games. Though it might take longer and longer if there’s no players nearby in rating at each step, and slower further by any games by non-real players, auto resign players like you describe, or sandbaggers.
The Elo scale can handle quite small odds. 400 Elo difference is defined as 1:10 odds. 4000 Elo difference is that to the power of 10, so odds of 1 in 10 billion. So I don’t think you really need a different scale.
Although it should be 480.
I get it, but in a theoretical sense, if you’re comparing a player that resign on their first move or a player that plays uniformly at random against a proper player, that knows the basics of the rules and scoring, those belong on a different scale.
A player that knows how to capture stones for example will most likely be able to beat a player playing uniformly at random, just capture the stones.
You can put them on the same scale, but what I’m suggesting is that they belong on separate scales.
Even the person that resigns on their first move belongs on a separate scale to a player that actually plays the game.
Did you see this 5 year old video where someone pitted various chess engines against each other in a tournament and determined their Elo ratings?
He included stockfish and various nerfed versions of it, and also a random player (discussed from 5:30). He also included various other handcrafted engines playing by silly strategies, like
The random strategy ended up with an Elo rating of 477.6.
The worst strategy was worstfish, ending up at an Elo rating of 207.8.
If he had included the “resign ASAP” strategy, I’m sure it would have ended up even lower than worstfish. I admit that this strategy is probably too silly even for the purpose of that video, but the video does show that you can have all kinds of players that are capable of playing full games of chess, while still playing consistently worse than random play. And it also shows that the Elo rating system is perfectly capable of accounting for such players.
It’s a cool looking article
I was flicking through some more articles mentioning Jeff Sonas and came across this one in 2023 with a recommendation
The 400-points rule: A difference in rating of more than 400 points shall be counted for rating purposes as though it were a difference of 400 points, with no restrictions on how many times it can be applied during a single tournament, thus restoring it to the pre-2022 state. Notably, almost 90% of received emails favored reverting to the previous 400-point-rule.
Yeah but we know with an Elo system, it’s likely to take an exponential amount of time if you want your rating difference to accurately reflect your winrate.
For instance if I win a game once in 10^N games, it’s going to take a lot of games for me to lose enough rating unless the rating changes are capped and modified in a certain way. Like instead of losing 1e-10 points in a completely one sided game, I always lose a minimum of like 5 points.
So I think you need to distinguish between what settling means in terms of the rating system is adjusting too slowly given the number of games played, and settling as as in, this number accurately predicts the winrates.
If we imagine the ratings update step as like a gradient descent method of moving players toward their true rating, if you have the learning rate too low you’ll never reach an actual minimum with a limited amount of update steps. So if the the K factor ~ like a learning rate is small compared to the expected difference in ratings (like if you expect thousand or 10s of thousands elo difference between Stockfish and a random bot and the k factor is around 20) then you probably wont realise that difference in a short simulation.
In fact I believe this is the point of the discussion around 39 mins in your linked video.
Well by capable you’re saying, can assign a number to it?
Any system can assign a number to players, but I would reserve capable as meaning can accurately predict the win rates of games between players at those given ranks.
I don’t see that analysis being done in that video. All I see is a graph showing calculated scores, which you can do by starting off everyone at the same rank and updating by results.
There’s other ways you can do it as well, treating it like an inverse problem, given the results, what elo score difference would likely lead to these results. But when one players wins 100% of the games against another, you can’t put a sensible elo difference there.
I’ll watch the video properly, but for instance if it turns out that the reason the diluted Stockfish draws against a terrible random bot is because of like a 50 move rule or something, I feel like that’s a kind of wash in some sense as a way to bridge the gap between “skilled” and “unskilled” random play.
In Go if we let games actually play out until their conclusion and didn’t accept a draw as a legitimate outcome for instance (superko rulesets) then I think we would more quickly run into issues.
Yes, you can’t really reach meaningful Elo ratings when the player population basically consists only of 2 players having a great skill gap between them.
As I understand it, he fixed that issue by creating players from mixing various amounts of good and bad strategies to bridge the gaps between “pure” players that would have too wide a skill gap between them to determine their relative Elo ratings from a limited number of games.
So by adding these mixed players, he created a player population that is more like a continuum of player levels and strategies, and he could get more or less consistent Elo ratings.
I suppose this method would also work for go ratings of progressively worse players. Like mixing random play with varying amounts of, for example, “self-atari when you can, play random otherwise”.
As I’m saying I’ll watch the video, but I’m guessing this works very artificially because chess has simple draws by repetition, the 50 move rule, and running out of pieces.
So you can draw against an opponent that plays well 10% of the time accidentally.
I think if you tried the same thing in Go with superko, or Amazons, or Hex, in games where you have a definite winner, I’m not sure you can bridge that gap simply. One player might “accidentally win” 100% of the time because their strategy is hugely more likely to benefit the goal of the game, like the cluster bot in that video, but not have it randomly called a draw because of repetition.
I’d think that when the player population is sufficiently diverse and continuous by skill level as well as by strategic and tactical repertoires, the only way for one player to end op winning 100% of their games in the long run (even against an ever so slightly dilluted version of itself), is when that player is by far the best player of that population, perhaps it’s even a perfect player (which may be entitled to having an infinite Elo rating in a game that doesn’t have draws).
This is exactly my point but just in the reverse.
We accept that a perfect player in a population of other players should have infinite elo rating in a game that doesn’t have draws. It doesn’t fit on the rating scale of normal players.
Simialrly I’m going in the opposite direction, when you have a player that’s so bad it’s unlikely to win games ever against a certain population, it should better fit at -inf Elo, but there’s possibly a variety of ways to lose a game.
It could be, and probably should be by symmetry, that there can be a number of perfect play “bots”, if they could exist, that have preferences on what kinds of winning moves they prefer.
Those “bots” should live on a separate ratings scale of their own to humans, and they would have likely a 50-50 winrate against each other, where the outcome is simply decided by who gets which color.
I mostly agree with that, except I think unlikely is not strong enough. To deserve an actual negative infinity rating it should be impossible for it to win.
I suppose the “resign ASAP” strategy would be a good candidate, but to ensure it’s impossible for it to win even against other players following that strategy, it would also need to have a response time in the order of a Planck time (5.39x10^-44 s), so it always resigns faster than any other player (other than itself).
Or you can limit resigning to being on your own turn.
Again though, you might want to limit the scale to players that actually play the game, since you can have an entirely separate set of scales, one for each move of a sensible game length where one player always loses to another and to every human.
Resign on move 1 bot always loses to resign on move 2 bot (if we assume “move 2” means the bots second move etc), resign on move 2 always loses to resign on move 3 etc.
So you might want to restrict to bots that play the game and don’t resign. Of course that kind of simulation would be annoying for Go, but maybe small board go would be fine, like 4x4 or 5x5 or something
You could mix these players to varying degrees to create a continuum of players and thus create a meaningful rating scale connecting them together, especially when you also include mixed players following different strategies. You could also use this method to fill up the rating scale gap between these and normal players.
I don’t understand why you’re so adamant to isolate such players from the rest of the player population. I fail to see the problem that is fixed by banning such special players from playing games within the same rating system as normal players.
I’m not aware of a rule set where this is the case, so I guess this would be a go variant?