Bot's rank fluctuations

Problem: the ranking of bots is (a) volatile and (b) poorly correlated to the kyu/dan ranking framework of human players.

A modest proposal: Don’t allow ranked games against bots.

How their ”skill” levels are labeled becomes irrelevant.

1 Like

I can’t say that allowing unstable bots or bots in general below a certain level to play ranked is even a good thing or not for rating system overall.

I don’t know.

All I know is, one reason we even rank bot games is so people can quickly adjust to their level in a different way that isn’t sandbagging or airbagging against humans.

That’s a tangible effect. We often get complains that people want to exclude ? players from automatch games or from accepting custom challenges.

So the bots serve a purpose that a dan player as a ? player (more so in the past than now), doesn’t have to hammer players from 15kyu onward before reaching a correct rank, games neither side will enjoy necessarily.

Or a 25kyu as ? player doesn’t have to lose 3-20 games just to drop to 25kyu (depending on whether a strong player resigned when they match with a beginner)


Other than what I mentioned, generally I’m not sure there’s a need to rank bot games for the majority of use cases.

I think some people also want to play a lot of games, beginners or experienced players, when there isn’t necessarily the matchmaking to support it. Or maybe they feel a bit stigmatised about wasting a humans time, or in making mistakes they shouldn’t be making, and the bots don’t really care.

Or you could do as suggested if needs be, and only rank a few games between bots and users (although if that can be reset by playing another bot, people will always find out and exploit loopholes)

1 Like

This certainly describes my situation as a newbie to Go. The bots have much appeal and meet a need.

But the value of playing against AI models doesn’t mean their use needs to mess up the human ranking system. That seems to be what causes the grief for the human community that’s being expressed.

The human ranking system seems to suffer, not benefit, from the rank effects of bot playing. (Perhaps a separate rank for games against humans vs bots would help?)

1 Like

It’s a fair point. My assumptions are based mostly on the math: if you pit two evenly matched players against each other, you will still get a graph like this:

Note: this is just using a glicko2 python lib. I didn’t use OGS’s constants for this simulation.

source
# glicko_variation.py
# pip install glicko2 matplotlib

import random
import matplotlib.pyplot as plt
from glicko2 import Player

def simulate(games=300, seed=42, draw_prob=0.0):
    random.seed(seed)
    A = Player()  # defaults: rating=1500, rd=350, vol=0.06
    B = Player()

    hist_A, hist_B = [A.getRating()], [B.getRating()]

    for _ in range(games):
        # Random outcome (optionally allow draws)
        r = random.random()
        if r < draw_prob:
            sA, sB = 0.5, 0.5
        else:
            sA, sB = (1.0, 0.0) if random.random() < 0.5 else (0.0, 1.0)

        # --- cache pre-game values for synchronous update ---
        A_R, A_RD = A.getRating(), A.getRd()
        B_R, B_RD = B.getRating(), B.getRd()

        # Update both using pre-game ratings/RDs
        A.update_player([B_R], [B_RD], [sA])
        B.update_player([A_R], [A_RD], [sB])

        hist_A.append(A.getRating())
        hist_B.append(B.getRating())

    return hist_A, hist_B

if __name__ == "__main__":
    A_ratings, B_ratings = simulate(games=300, seed=7, draw_prob=0.0)

    # Plot ratings
    plt.figure()
    plt.plot(A_ratings, label="Player A rating")
    plt.plot(B_ratings, label="Player B rating")
    plt.title("Two evenly matched players (Glicko-2, synchronous)")
    plt.xlabel("Game")
    plt.ylabel("Rating")
    plt.legend()
    plt.tight_layout()
    plt.show()

There are certainly other factors - I mentioned exploits in my earlier posts as well.

To what extent might this be attributed to the log transformation between dan/kyu and glicko? If we assume a similar glicko-scale fluctuation across the spectrum, then dan players will see less variability in their rank than kyu players, right?

3 Likes

That was it basically. The other is “Last week I played Amaranthus and it was 30k, this week I played Amaranthus and it was 20k… felt the same to me :man_shrugging:

4 Likes

Counteracting rating drift to produce well-defined stable ranks for the whole server. When I reach dan in 100 years it would be nice to know that it’s the same “dan” I remember from before :blush:

Of course this would be clearly spelled out in a detailed proposal, not just something for someone to randomly do on a whim some weekend.

It could be done by locking the bot rank completely or more gently, by applying a small bias to keep the rank near its target over time.

We’d need to pick a bot or bots that we can keep running at exactly the same performance for years, even as the server operating system and hardware gets upgraded.

That’s the reason I’ve been experimenting with my “nixbot” line of bots! None of the existing bots provided precise details on their software and configuration.

2 Likes

It’s much more ambitious than what I was thinking, but an awesome goal.

It seems very difficult to get right, given that bot weaknesses are reproducible in a way that that is more pronounced than human weaknesses. I’m curious if such a thing is even possible (after all, we’ve seen exploits of even the modern, strong bots!).

3 Likes

I agree, the same way when you toss a coin 100 times, you won’t see it always be a 50-50 split of heads and tails even if you assume and suspect the coin to be fair.

From your graph I guess we’re seing maybe 100 point swing in rating, 1450-1550. In rank terms mayve thats between 5-6.5kyu. If we pick random but reasonably stable 5ish kyu player, I see something like 4.9± 0.9kyu.

image

But that’s a way different scale to what’s happening to bots like amybot-ddk.

image

I believe the volatility parameter is capturing this information even more than the standard deviation (which is already much bigger). The volatility isn’t usually seen much but you can see it in the termination api or in the rating calculator.

Without giving away too much on who the player is, you can see the comparison. 0.06 or less is probably normal for ordinary stable players, while for some of the really volatile bots, they literally cap out at 0.15.

I think the standard deviation is what takes this variability into account. The glicko system doesn’t really assume everyone has a 1 point precision rating, but has some variability and volatility to their ratings.

Sure the smaller the rank bands the more you might fluctuate between rank bands. But still we should expect the same rating fluctuation no? Is glicko translationally invariant? Does it matter if we started everyones rank at 3000 and went up and down from there?

2 Likes

I believe this is the case (ergo, dan/kyu rank is not translationally invariant)

The problem is people learning to exploit them, right?

I think you just don’t let the anchor bots play that much. Only a few people will know the weaknesses and they aren’t going to sit around waiting for a chance to have a single ranked game. Even one game a day would give you a good sense for how our ranks are drifting over months.

2 Likes

Even if the coin and table are both stable over time, the analogy breaks down if you keep tilting the table in one direction or another. That seems to be more akin to having a number of human players come and go with different skill levels that are also changing over time.

The challenge remains: trying to incorporate a Go-playing algorithm into a human-based ranking system. They’re fundamentally different beasts.

Perhaps best to let the bots play each other and have their own ranking system. And the same for the humans?

1 Like

Well the reason the coin flip analogy is quite apt in response to @benjito , is because if you assume two people have the same rating, and that’s both their true rating, then the outcome should be 50-50. Player A winning will increase their rating, and B losing will decrease.

You can imagine the difference in rating in case of A playing B to be similar to the difference in coin outcomes heads vs tails, it’s fluctuating up and down, positive and negative all the time. It’s very rarely perfectly steady and stable.

Yes real players and real ratings have many other players involved and it’s much messier, and we often have people ranked not at or near their true rating involved also.

We’ve incorported them fine, you can rate them exactly the same way you rate humans, but their stability in such a system is the issue.

The issue brought up is whether that’s a feature of glicko2 (the current system) or whether it’s inherent in the bots play, their own instability coming from whatever sort of algorithm they’re using.

3 Likes

Right, I think the coin flip analogy is a good one, since that’s how I coded the simulation:

sA, sB = (1.0, 0.0) if random.random() < 0.5 else (0.0, 1.0)

@SomeGoGuy The purpose of the simulation was to show that we can remove external factors such as bot exploits and manipulation, and still have an lower bound on the fluctuations - it’s an artifact of glicko (and probably any statistical rating system).

We could certainly introduce uneven matchups if it would be useful, but I wanted to keep the simulation as simple as possible.

3 Likes

I don’t think it can be fully explained as inherent to bots. The Glicko 2 paper recommends reporting rating as a 95% confidence interval of ± twice the rating deviation (RD). I don’t know if OGS reports ratings as ± 2 standard deviations as suggested, or uses ± 1 standard deviation, but either way, the intervals on some bot accounts look much too small to me if they’re supposed to represent 68 or 95% confidence intervals.

I don’t know where the issue lies, but looking up bots and humans in the rating calculator, I note that volatility seems to be constrained to between 0.06 and 0.15. Humans tend to be on the low end and weaker bots on the high end? Deviation (for active players) seems to run between roughly 60 and 125, though unlike volatility I didn’t see any suspiciously round numbers.

My intuition is that there’s something off there, but I haven’t yet wrapped my head around the math well enough to guess at what. The main knobs to turn seem to be the “system constant” τ, and the determination of what constitutes a “rating period”. I wonder if maybe Glicko 2 doesn’t work so well when you have some players playing hundreds of games a day and other players playing maybe one? And if I understand what I’ve read about Glicko 2 correctly, larger deviation values would make the rating swings more pronounced, not less, which is the opposite of what we want. I’ve also read deviation explained as supposed to get smaller the more games someone plays and larger the more “rating periods” they don’t play, which breaks from the idea that deviation corresponds to standard deviation.

Part of me just wants to say “calculate the mean and standard deviation of a bot’s Glicko 2 rating over the last 5000 games and be done with it”.

4 Likes

Great point.

I still (pure intuition) think “it’s because bots don’t have the inherent quantity we are trying to measure”.

The idea we have of human skill, measured by the metric of rank, has the assumption that humans will perform roughly the same independent of scenario (my assertion), and will correct in new scenarios towards that value.

This is not the case with bots.

I think that you “can’t” do this with bots, because they don’t behave in the same way.

I feel less sure about this for modern bots: they might “behave as if” they have a skill level that is similar to humans, as long as they aren’t provoke with very special case known holes like ladders.

I think the first thing you would have to do is establish whether this rank predicts the outcomes vs humans of various other ranks in the way we expect. If my intuition is true, it would be bad at it.


On the other hand, I’ve wondered the same question about humans, to a lesser exent: we all have RDs that appear smaller than the variation of outcomes, at purely “eyeballing” level.

I think I might have found an explanation while reading about other systems. From here: An Elo-like System for Massive Multiplayer Competitions. The bit that stuck out to me:

While it may seem like a good idea to boost changes for players whose ratings are poor predictors of their performance, this feature has an exploit. By intentionally performing at a weaker level, a player can amplify future increases to an extent that more than compensates for the immediate hit to their rating. A player may even “farm” volatility by alternating between very strong and very weak performances. After acquiring a sufficiently high volatility score, the strategic player exerts their honest maximum performance over a series of contests. The amplification eventually results in a rating that exceeds what would have been obtained via honest play.

I don’t think bots are gaming the system, but rather I suspect they might get stuck in a volatility loop. High volatility → large rating shift → high volatility, and so on. Why that would happen to bots specifically is a separate question, as is whether it actually is just bots.

4 Likes

I don’t understand the quoted conclusion.

It seems that a player attempting this strategy might momentarily ( * ) achieve a higher rating, but then they’d do what we see bots doing: collapse back down lower. I don’t see how you can “farm volatility” to maintain a higher rating?

But … do we see high volatilities? I personally haven’t, though I haven’t seached systematically … though I have been curious and looking from time to time.


( * ) in the English sense of the word :stuck_out_tongue: :squinting_face_with_tongue:

1 Like

I probably should have ended the quote earlier. The part that caught my attention was “A player may even “farm” volatility by alternating between very strong and very weak performances.”. I suspect bots might be getting stuck in that state unintentionally.

As for volatility, some examples (forgive the formatting, this is lazy copying from API responses)

Amaranthus:
      "rating": 654.3335047147162,
      "deviation": 106.13085933437775,
      "volatility": 0.14985121352502478
amybot-beginner:
      "rating": 702.9264869436303,
      "deviation": 120.30072207472728,
      "volatility": 0.14956478949217464
Agapanthus:
      "rating": 683.5510487010632,
      "deviation": 124.64858866549983,
      "volatility": 0.14992895438082404
Bergamot:
      "rating": 1193.7818539780046,
      "deviation": 125.35935900289866,
      "volatility": 0.14889168227845906
Bouvardia:
      "rating": 947.4232850715038,
      "deviation": 105.5460271608031,
      "volatility": 0.1477592289088039
GnuGo:
      "rating": 1121.6080666015412,
      "deviation": 102.37309600063944,
      "volatility": 0.1498847340390198
amybot-ddk:
      "rating": 1214.876649476943,
      "deviation": 102.38469307938327,
      "volatility": 0.14983984938287886
noob_bot_3:
      "rating": 1157.8434900953864,
      "deviation": 108.84725612333311,
      "volatility": 0.14944262392954683
gnugo-nixbot:
      "rating": 1161.3397078895166,
      "deviation": 67.29046961715564,
      "volatility": 0.06969841932492986
DangoApp:
      "rating": 1767.956549896383,
      "deviation": 89.83800466809697,
      "volatility": 0.05999973671298351
Carnation:
      "rating": 1036.2448819591905,
      "deviation": 102.31072582122572,
      "volatility": 0.14995001037040748
noob_bot_2:
      "rating": 1234.5564007400521,
      "deviation": 81.13759744484666,
      "volatility": 0.09804240715906964
Fuego:
      "rating": 1337.6866550899074,
      "deviation": 78.58105233021269,
      "volatility": 0.07400563359272345
fuego-nixbot:
      "rating": 1314.4244083848046,
      "deviation": 66.84123609141945,
      "volatility": 0.062262770338260436
noob_bot_1:
      "rating": 1500.6746287879837,
      "deviation": 63.48835496572388,
      "volatility": 0.05834678165542589
amybot-sdk:
      "rating": 1667.8994139181582,
      "deviation": 83.13775408491934,
      "volatility": 0.08087270747871336

These numbers are weird. I find it suspicious how many bots are just shy of 0.15, and that there’s an apparent difficulty cutoff where that stops happening. But I think more detailed analysis should be left to someone who can query the database directly. One thing I can’t do with the REST API is see how many humans appear to have their volatility pegged, for example. I’d also be curious to see what the distributions of deviation and volatility look like across all users, and if there’s any interesting patterns in the relationships between rating, deviation, volatility, number of ranked games played per day, and so on.

3 Likes

I can believe that some of the bots are so weird that playing with them isn’t really like playing Go. For example, you could make a weak bot starting with KataGo and adding occasional random moves. But a game against that bot that would be more like playing a strange go variant, and I can believe that its rank would depend mostly on how much the current crop of players is into go variants, more than on anyone’s actual skill at Go. (Note: this is measurable in principle, by comparing multiple bots over time.)

To take it to an extreme, I wouldn’t expect to get a great ranking system by anchoring to our players’ performance playing chess against a chess bot.

So I agree it could easily be problematic.

Among our bots I think I’ve only played seriously against GnuGo - just a couple of times - and I don’t find it that weird. I played a game against it just now and it felt like a somewhat distracted 10k who sometimes manages to pay attention and show great reading skill. If I encountered it in a normal pairing I probably wouldn’t suspect that anything was unusual about the game.

I’m sure I’d notice more weird behavior if I played it more. Not sure how much that matters.

1 Like

Is the code any help?