I remain skeptical of the idea of “skill” of a bot in the same sense as “skill” of a person, especially in the weak bots. Its not clear to me why semi-random moves map to any particular skill level.
Bots don’t have fluctuating skill levels, so the fluctuations must be an artifact of the rating system.
I’m convinced that it’s the result of repeated games against the same player. Look in any bot’s history and you’ll find players who play 10 or more games against the same bot, over and over. Then they switch to another bot and repeat:
Each of these repetitive results gets factored in as if it was a statistically independent result, driving ratings up or down far more than they really should be. Give us tools to block repeated games, or weight them differently in the rating system, and we’ll have stable bots!
(Also, for using a bot as an anchor, you might only care about long-term stability; the day-to-day fluctuations don’t necessarily matter.)
I’m not sure about whether this is true. I know little about these things, but my loose understanding is that the rating system should withstand this. Each time I play you and win, my rank does up less because my resulting rank is more higher than yours. So it should be with repeated wins from a bot?
Maybe our glicko is tuned for rapid rank change that favours human goals (like quickly getting the correct rank) and isn’t great for this bot repeated game situation
OTOH playing vs a bot to quickly get myself the right rank is exactly the use case we have.
This is such a weird take. If anyone just sat and watched a few weak bot games you’d realise it doesn’t make sense.
and
I think in the case of Amybot it was something about a neural net not realising a stone or group of stones was in atari.
But the point would be that bot can be a decent ddk like 13kyu and not see the most basic Ataris.
It can be winning all game and then throw away the game by ignoring a massive Atari for seemingly no reason.
It’s a similar effect you might get if a human was playing with a handicap where if they flipped a coin and it was heads they had to tenuki, and tails they can play whatever they want.
Essentially you won’t get a stable rank from randomness like that. You can beat players regularly because you’re bot is decently strong (it can make it to 13kyu etc) but it can lose to anyone regularly, even 25kyus.
Why wouldn’t they be statistically independent if they’re not the exact same game? If the users in question were exploiting the bot for being more or less nondeterministic (like exploiting a bot that always plays a joseki but can’t read a ladder involved), then yes I would say the results are not independent.
There’s nothing wrong with two players playing each other a lot in order to have their relevant ranks settle a correct amount apart if they’re just playing normally with no exploits.
In a go club, you might play a number of games against someone, and do a manual version of what the rating system does: I’ll give you nine stones, ok I lost, I’ll give you 8 stones, ok I lost, I’ll give you 7 stones ok I won. The mental adjustment you might do is maybe you’re slightly weaker than you think for losing these games, but your rank is stable enough so it’s not going to move much, but for the unranked player it moves several ranks to match the handicap.
There’s nothing inherently wrong that you played the same player multiple times.
Now could it be a possible improvement to damp certain bot abuses by making multiple repeated games and exploits being used, sure it probably could.
I don’t think you’ll have stable bots because of this change, you’ll just slow down the timescale of the fluctuations. It might go from 10mins to a day instead.
Your rating goes up or down depending on your performance: if you score better than your old rating (relative to that of your opponents) suggests your rating goes up and if you score worse than that your rating goes down. Deviation acts as a multiplier on your rating change; having a high deviation means your rating gains and losses will be amplified. Your deviation changes after each set too; this change is driven by your volatility. If your deviation is high compared to your volatility it will go down, if it’s low compared to your volatility it will go up. Finally, your volatility itself will be updated by the results of your games. An extreme score such as 5-0 or 1-7 makes it go up while a score of 3-2 or 2-3 makes it go down.
I think the last sentence here is particular relevant to the situation with bots playing many games with the same person in a row. Since the bot doesn’t “learn” or “react” to the opponents play across different games, many people quickly figure out how to beat a bot close to 100% of the time (this also ties in with people reinforce bad behaviour when “climbing” while only playing bots). So it seems likely that the volatility of the bots rating is high because the results in its rating update batches have a tendency to be lopsided (people lose all games until they figure out the bots weaknesses, then they win all games).
Granted, this assumes that the matches are between two similar ratings. If that’s not the case (i.e. we would expect the results to be somewhat lopsided), then I don’t know how it would affect the ratings with Glicko-2.
Now I don’t think we can expect any rating system to be perfect (in fact I think that trying to represent playing strength by a single number is a big simplification, as I’m sure it’s possible that A beats B beats C beats A).
Humans have a “skill” - demonstrably we perform at a roughly consistent level over different scenarios (oponents, opposing strategies etc).
In contrast, bot’s performance will be highly opponent dependent and opposing scenario dependent, wildly so.
“Because they don’t learn” (for example, they don’t learn that a ladder happens so they are always bad in that scenis only one factor). Other factors will include their hard coded heuristics (espcially simplistic bots).
THe net result is that unlike humans, bot’s performance varies wildy, so the concept of “a level of skill” in the same way that it applies to humans, doesn’t apply to bots.
I believe any stable human or bot will show a decent amount of variation under OGS’s rating system. The reason it was designed that way is a tradeoff that allows humans who are rapidly changing or misranked will quickly reach an appropriate rank.
The reason I suggested anchoring bot rank is that we know bots will never get better or worse. amybot will not do a bunch of tsumego and reach Dan. Any variation is a result of the RNG (or exploits, though I think that’s a separate thing).
I might not know what you mean by “anchoring bot rank” actually!
Care to elaborate?
I don’t think this is actually the case.
Our “level” or “rank” is intended to be a predictor of our performance based on our opponents rank/level.
When I say “bots don’t have a skill like we do” I mean that their rank, obtained in the same way as ours, is not a predictor like that.
A bot’s outcome is far more scenario dependent than a human’s (I assert). Therefore rank is not a predictor for them, as much. IE they don’t “have” a rank in the same way we “have one”.
I just mean assigning them a rank and keeping it. Instead of letting Amaranthus fluctuate between 20k and 35k, just “anchor” it at 27k (or something in that neighborhood)
And yet, we give them one. The question is whether to let this rank flail about based on who’s currently exploiting it, or hold it steady.
I think we agree on this point. Bots have exploitable blind spots (even more than humans), and that’s not great for the rank system.
I think “ranking system experts” would need to comment on the effect of that - I can guess some rank-pool-distortion argument, though we’re way out of my balliwick with that.
Basically we’ve already seen what happens when you let a 25kyu bot take 9 stones on 9x9 and still lose, you get players hitting 27dan rank.
If you force a bot to maintain a falsely high level and reward players who beat it, you can potentially inflate a lot of users ranks.
If you force it to maintain a rank at falsely low level, you’ll massively deflate the users ranks that challenge and lose to it.
I would be very careful which bots you chose to anchor, if going down that path.
KGS uses an anchor system for humans, which you’d think would be more sensible but because of the recalculations the system does to keep those players fixed it wildly fluctuates everyone else’s levels.
Now it’s an exaggerated effect for KGS because even inactive players get recalculated.
But, if you anchor say bot ranks (badly), you will certainly cause wild fluctuations for the users that play them (maybe not so much everyone else), but in turn those effects spread slowly to the rest of the player base.
So we could run into an inflationary or deflationary effect for certain rank ranges.
I don’t think that this is an issue that should affect the proposal.
That’s because it makes no difference whether the bot’s rank goes down (from 25k) or stays at 25k during this abuse scenario: the abuse still happens. That needs to be fixed no matter what, the vale of the stones needs to be correct.
(I can’t comment on the rest of the points, I literally don’t know )
It’s an example of what happens when you reward someone with points based on the incorrect likelihood that it should happen.
If I’m a 18kyu who can beat an incorrectly anchored bot at 14kyu because that was the mean rank it had, maybe fluctuating from 18kyu to 10kyu (eg Amybotddk currently), then you will certainly inflate players ranks.
It shouldn’t be that likely that an 18kyu player can regularly beat a 14kyu player in an even game, especially if 4 stones of difference means something, not as likely as it is happening.
Why is it happening so often? Because certain weak bots ignore an atari that no human at that level would do
13kyu bot plays ok, but ignores an Atari for no reason.
All you have to do is look at some of the games to realise why some of these bots are unstable.
I don’t think katago-micro would show variation except maybe a small trickle of gaining 1 point or so per game, if it played more ranked games and you could remove any effects of people cheating against it.
So I don’t agree that a stable player automatically will be volatile purely because of glicko2.
It’s an amalgam of effects, but I wonder how many people are making assumptions without actually looking at the games themselves of the bots that have such variability.
But but but… this is a problem whether or not the bot’s rank is anchored?
This is just the “bot abuse” problem, isn’t it?
I can see an argument that says “anchoring the bots will accelerate the rate at which bot-abusers can change their rank”, but the problem there is bot-abuse, not anchoring, right?
Hang on, though, what problem is anchoring supposed to solve?
Other than “oh, I got a suprise at the rank of this bot today”?