Bot's rank fluctuations

I don’t know what. you mean by suspiciously round numbers.

The volatility is capped at 0.15, you can look at the code if you want.

The rating calculator rounds things rather than show you 15 places of volatility. You can use the termination API to see the full number like https://online-go.com/termination-api/player/427361

“overall”:{“rating”:1820.8870837803515,“deviation”:63.26375467851684,“volatility”:0.05999128826204747}

Lots of things are rounded for viewing because nobody cares about that many decimal places.

Are those dependent or interacting with each other somehow? What if those two types of players never played each other? It could be true that the bots are hitting one extreme and correspondence players are hitting another.

I don’t think there is an update period though, there used to be one with some rolling period of 15 games but it had some unintuitive issues. I’m also not sure if updating the deviation with inactivity is in production or not. I think it was added into the repository but I’m not sure if it’s in production yet. So maybe someone can answer that?

Is it not more or less what a standard deviation is, it’s an uncertainty in your play (which would be indicative of a spread of ratings over the course of a number of games). Your rating isn’t being assumed to be a number, but rather a distribution which has a mean and standard deviation.

I do want to read through the Glicko papers, glicko, glicko2 (more like how to implement the algorithm) papers and his more technical paper, but my statistics knowlege maybe needs to be expanded a bit to understand it.

1 Like

This comment got long, so I’ll start with this: No, I still haven’t arrived at any concrete answers or explanations :sweat_smile: Except that I think I’ve figured out why rating intervals seem off at a glance (explained below).
After some more reading and thinking, I’m mostly just more convinced that I’d need to do some serious data analysis to make progress.

A lot, yeah.

That seems to be the case empirically, but I searched the code base on GitHub for “MAX_VOLATILITY” and got no results but that line. So I’m not sure yet why bots seem to cap out at 0.15.

The repo has five different implementations of class DailyWindows, ranging from per game to weekly rating periods. Obviously some per-game calculation is happening, because ratings on OGS update after every game. I’m trying not to simply assume there isn’t more complicated, time window based calculation going on behind the scenes, because I have read that OGS has used longer term corrections in the past, and I haven’t reviewed enough of the code to rule it out. If per game calculation is all that’s being done, that could be the issue? Glicko 2’s algorithm is intended to be run on batches of games, which might help smooth things out. Maybe. I’d have to poke around at the math and/or code to be sure.

More or less, yes, but I don’t understand the relationship between Glicko or Glicko 2 calculated deviation and standard deviation as the Glicko and Glicko 2 papers don’t explain the logic behind the math. ± 1 standard deviation should give a confidence interval of 68% (rounded). I think the answer is simply that Glicko deviation isn’t standard deviation? Glicko and Glicko 2’s deviation calculations both depend on constants which have to be chosen by the implementer, and the Glicko paper outright recommends setting a lower bound on rating deviation. All of which is to say that OGS’s reported confidence intervals represent an unknown amount of confidence, so eyeballing rating history charts can’t tell us if there’s a calculation error. Bots having frequent rating swings outside their reported confidence intervals might be a feature, not a bug. But also, it might be possible to tune the constants to get a better approximation of standard deviation.

Finally, a pragmatic consideration: I’m not sure there’s anything that can be done to smooth out the big swings which wouldn’t either have undesirable side effects for users (e.g. ratings not updating every game) or require a substantial code change (e.g. conditional handling for bots). Though as purely a UX change, maybe the bot list could include a rolling average rank for each bot, in addition to its current rank?

4 Likes

[unnecessary comment retracted, apologies]

Glickmann has a more technical paper dicussing the ideas here

From the implementation paper, confusinging similarly named

The Glicko system therefore extends the Elo system by computing not only a rating, which can be thought of as a “best guess” of one’s playing strength, but also a “ratings deviation” (RD) or, in statistical terminology, a standard deviation, which measures the uncertainty in a rating (high RD’s correspond to unreliable ratings). A high RD indicates that a player may not be competing frequently or that a player has only competed in a small number of tournament games. A low RD indicates that a player competes frequently.

Maybe I don’t fully understand what you’re describing as this human quality, but it sounds like the notion that human players will learn and adapt to new scenarios (I assume this means e.g. encountering a new playstyle?).

Personally I think if a player encounters a new environment they struggle with, and eventually learns how to cope with it (as I think you’re describing), I would think intuitively that player has increased their rank not just held it steady.

Conversely, I think there’s plenty of players who might just play at a lower level against specific styles (maybe I struggle against fighty opponents), and higher level against others (maybe I’m good at countering territorial play). Seems reasonable that there’s multiple paths to a given rank. In that regard it seems reasonable too that a bot, or indeed a more rigid human player that fails or refuses to learn new skills, can still hold a steady rank.

3 Likes

@ shinuito ​ :

That seems to be the case empirically, but I searched the code base
on GitHub for “MAX_VOLATILITY” and got no results but that line.
So I’m not sure yet why bots seem to cap out at 0.15.

emphasis added

1 Like

Yes I’m an idiot here. I didn’t read carefully enough myself to understand where/what the issue was.

It looks like the 0.15 that is used is hardcoded in rather than being referred to as max_volatility further down.

       volatility=min(0.15, max(0.01, new_volatility)),

The 0.15 and 0.01 could be replaced with the min_volatility and max_volatility from before.

Wait … um…

What’s the TL;DR of “why we have a max volatility”?

It sounds like “the cause of the problems”?

Maybe but if anything the volatility being at 0.15 is already a “problem” in that it’s contributing to faster rating changes. It’s one of the key differences between ordinary players and bots.

I think one of the main differences between glicko and glicko2 is the introduction of volatility.

The volatility measure indicates the degree of expected fluctuation in a player’s rating. The volatility measure is high when a player has erratic performances (e.g., when the player has had exceptionally strong results after a period of stability), and the volatility measure is low when the player performs at a consistent level.

The volatility is doing what it’s supposed to be doing, it’s telling us that the bots are having erratic performances. I’ve showed examples of why that is for Amybot, where it’s winning and then ignores big ataris seemingly randomly and so can lose to almost anyone.

I think practically it’s used to modify the deviation (increase it? - theres a lot of functional transformations), which I presume will then allow for players to change rank faster because of their erratic performance.

I’m not sure what the normal circumstance of high volatility would be (if we take the bots case as exceptional - though some generally are volatile). Would it be a player that took a break from playing rated and came back stronger and so they have a lot of unexpected results.

Or would it be more like a player that plays lots of blitz games, and the skill and results in blitz could be a bit more erratic because low time and intuition plays can lead to more mistakes that flip the result often.


Small sidenote as @Cthulhufish was saying before

Directly underneath the previous quote is

As with the original Glicko system, it is usually informative to summarize a player’s strength in the form of an interval (rather than merely report a rating). One way to do this is to report a 95% confidence interval. The lowest value in the interval is the player’s rating minus twice the RD, and the highest value is the player’s rating plus twice the RD. So, for example, if a player’s rating is 1850 and the RD is 50, the interval would go from 1750 to 1950. We would then say that we’re 95% confident that the player’s actual strength is between 1750 and 1950. When a player has a low RD, the interval would be narrow, so that we would be 95% confident about a player’s strength being in a small interval of values. The volatility measure does not appear in the calculation of this interval.

I think we generally show rating ± deviation as opposed to rating ± 2*deviation.

I guess we’re showing the 68% confidence interval (I think in glicko theres a fair few assumptions about normal distributions).

If we did double everyone’s display, it’s kind of just changes how we interpret it. Rather than ±60 rating points being standard enough, we’d instead think ±120 rating points is standard enough for a stable player.

Ah true! Fast rating changes, and ironically slow rating changes for users trying to rank with it as a result!

Shouldn’t it do the opposite? If volatility is high, rating changes should be smaller, because very strong or weak performances are expected and do not hint at a player having actually improved or weakened. Or am I misunderstanding something?

It’s amazing how popular glicko has become with so little documentation :joy:

6 Likes

I don’t know. At least from that quote it sounds like you don’t want to tweak peoples level much when they’re playing consistently (like the deviation tends to go down overall for players over time). If they’ve taken a break or they’ve undergone rapid improvement you should allow them to get up to their proper rating faster.

I kind of assume for normal players in long enough time settings it would just give them a kick to improve rating quickly. Maybe we can do some tests on beta, or do some small simulations.

But I think bots are strange. Like I can imagine blitz and rapid would be a normal place where you expect people to play in a bit of a volatile way. As in some days you’re doing well and some days you’re in a bad mindset or on a losing streak (on tilt) and then you play worse. Next day you come back, and the quick adjustments are probably to make sure you matchmake well at your current level. But there could be a case as your saying to diminish fluctuations like that.

But bots (weak bots in particular) are playing very erratically over long periods of time. There can be stability where they beat who you expect, but then erratic results based on the whatever specifics are driving them to be weak in the first place (ignoring ataris, picking bad moves intentionally/randomly etc).

Assume a bot that is pretty stable. Before each game it randomly decides with a fair coin toss whether it will play this game as a stable 5k or as a stable 15k. What are we expecting its rating, deviation and volatility to be?

2 Likes

Probably what we see for the unstable bots, rating depends on when you stop it playing, deviation probably like 100+, volatility much closer to 0.15.

But to be fair it depends on its opponents (the calculations depend on your opponents data too). For instance if it only played 5kyus, maybe it would just deflate a bit because it’s not always playing as a 5kyu, so maybe it drops to 6 or 7 kyu, maybe the deviation is a bit higher than normal like in the 80s, and volatility a bit higher like 0.7 or 0.8 - though I don’t know exactly. (I’ve lost 9 stone games to people much more than 9 ranks apart and it doesn’t really affect the volatility when it’s only like one game and infrequent).

Now if it played a whole range of opponents from 20kyu to 2kyu, it could cause much more fluctuations and mess, when it’s rated as 5kyu but decides to play as a 15kyu and loses half the time the people up to 10 ranks lower.

1 Like

Good point considering opponents.

I’d expect it to settle at a pretty stable 10k with a high volatility. I’m not so sure about the deviation, I’d assume it tells how much the rating system knows about a player and it knows a lot about the fictional bot, so I’d expect it to be low.

1 Like

So I have the goratings python code from the repository I linked.

I think it’s more or less working, maybe there’s some typos to fix (in my tests).

I want to sketch a few caveats and assumptions and then some test cases to see if things are behaving as expected.

Basically, I don’t think I’m using the right formulae for winning probabilities when I simulate results of games. I feel like it’s kind of important for more realistic results. Hopefully it’s not too far off, but I guess it could be another source of uncontrolled randomness, that maybe one might want to eliminate if possible.

Caveats:
  • I’m using a formula from the Glicko (not glicko2) paper for the expected outcome, not the one in the repository. See for example Basic rank maths questions - #26 by shinuito and Expected winrates based on Glicko2 ratings - #4 by shinuito for the discussion as to why we’d expect the expected outcome to depend on both players ratings and both players deviations.
  • I didn’t remember to try to convert those formulas from glicko to glicko2. As a point of note it mentions in the glicko2 paper and mentions how to do the conversion. So there might be some extra constants needed.

The rating scale for Glicko-2 is different from that of the original Glicko system. However, it is easy to go back and forth between the two scales.

  • With that, I used the expected outcome to simulate the results and I think I should’ve been using the probability formula. I only really noticed that detail after making the below graphs. The formulae look similar, but different in that there’s an additional sqrt involved in the combination of the standard deviations for the two players.
  • I feel like I need to read the more technical paper to properly understand why these two things are different in the first place unlike with Elo. But probably I should go back an make the above changes
Assumptions:
  • The player whos rating we track plays at a 5kyu level or a 15kyu level each game, regardless of their current rating. What that means is that the chance of winning only depends on the level they’re supposed to be playing at, not the current rating they’re at.
  • Bearing in mind the caveats about winning probabilites being a bit off in the simulation, hopefully it works somewhat the same to capture some details that we see.

Tests

Test case 1 - stable 5kyu (1550) only plays other stable 5kyus

Our player is 1550 with deviation 60 and volatility 0.06 (I would say stable). The values update after every game.

Opponent is always 1550, 60, 0.06.

The probablilty of winning here or the expected outcome (between 1=win and 0=loss) is only dependent on the original rank, not the current rank, since we’re assuming the player should truely be at that rank (with those initial parameters).

I think this is more or less what I see after a few runs. Sometimes the deviation goes up a bit in 300 games, once or a few time. I think it’s more or less when the player is on a winning or losing streak, but then settles as the player returns to their stable rating (1550), there or there abouts. The volatility doesn’t really change a whole lot, even with losing or winning streaks. The Expected outcome is basically the same as the rating of the player, because they always play a player with fixed rating 1550, so the winrate is more or less tied to the rating difference.

Test case 2 stable 5kyu vs random stable opponents

Our player is 1550 with deviation 60 and volatility 0.06 (I would say stable). The values update after every game.

Opponent is a random rating 1550+N(0,500), 60, 0.06.

The player plays a lot of randomly rated opponents, there’s always chance for upsets. The rating fluctuates a bit, and the deviation is a bit higher than normal, but the volatility doesn’t really increase.

I’ve shown the expected outcome for the current rating in red, and the theory one (because we’re assuming the player really is 1550 in blue). Sometimes they should have a higher chance of winning that would be predicted by their current rating (underrated), other times it should be lower (like if they’re overrated).

Anyway it’s not too crazy.

Test case 3 Player is 50-50 a 15kyu and a 5kyu vs only stable 5kyus

Our player starts at 1550 with deviation 60 and volatility 0.06. They flip a coin as to whether they will play as stable 5kyu or a stable 15kyu (1050,60,0.06), and this is used to calculate the expected chance of winning. Their values update after every game.

Opponent is always a stable 5kyu (1550, 60, 0.06)

It’s partly what I expected that the rating deflates, and the deviation goes up, but not by as much as I would’ve thought. The volatility is also not really what I expected either.

You can see the expected outcome in theory flip between a 50% chance of winning (playing as a 5kyu against a 5kyu) to basically 0% chance of winning, a 15kyu vs a 5kyu.

Another example of the same

Seems fairly similar.

and yeah definitely seems to be trending doward toward 9-10kyu.

Test case 4 Player is 50-50 a 15kyu and a 5kyu vs random stable opponents

Our player starts at 1550 with deviation 60 and volatility 0.06. They flip a coin as to whether they will play as stable 5kyu or a stable 15kyu (1050,60,0.06), and this is used to calculate the expected chance of winning. Their values update after every game.

Opponent is a random a stable player (1550 +N(0,500) , 60, 0.06)

This is a bit more what I was saying about random opponents, that there’s more chance of bigger upsets for the current rank vs the playing skill. The deviation is heading toward 80, the volalitility is actually increasing now.

Another example

I did a couple of runs for more games, but there’s no like sudden spike to 0.15 volatility, but rather it’s just a fairly slow increase in volatility over thousands of games. Something like

I did a couple more with starting at 1550 but playing as like 5 kyu and 25kyu, and 15kyu and 25kyu, and shifted the opponents to be centered to more like 25kyu level. But again, more like a slow increase of volatility and deviation maybe capping out at 80 or so.

So it might be a very long term effect that the bots have just kept increasing in volatility over time. (I say long term for ordinary human timescales, but the bots probably play thousands of games a day :slight_smile: )

5 Likes

From the Glicko-2 paper:

The Glicko-2 system works best when the number of games in a rating period is moderate to large, say an average of at least 10-15 games per player in a rating period.

I think the idea is that an inconsistent player’s higher volatility makes their rating more sensitive to general changes in their play, while their game-to-game inconsistency is averaged out over the rating period, and their opponents are less rewarded/punished for upsets against the player who is known to be inconsistent.

2 Likes

I guess the idea of a rating period is a bit tricky to define when you might have some correspondence players finish maybe a few games per month, and some players playing 10 games a day. Then the extreme of bots finishing like 1000 games a day.

Like one rating period that would cover all player types for one rating system.

The reason we have one overall rating that used for matchmaking though is in theory because it’s as good as a predictor of the winners of games as any of the individual rankings.


Probably people also expect or have come to expect their rating to update after each game. If there’s delays like the server being backed up we typically get reports about it.

So I suppose theres both theoretical and practical aspects to make certain decisions I guess on what to do with the ratings system.


But maybe there is something that could get averaged out of you divide up the day into chunks and process rating changes that way. Maybe it could smooth things out a bit?

1 Like

That linear increase from .06 to .064 over 10000 games looks a bit concerning… is it possible that volatility just increases gradually over many thousands of games until it hits the cap at .15? It looks to be locally behaving ok in the short simulations, but could there be some rounding error, math mistake in glicko-2, etc. that systematically drives up volatility over many games?

I’m also going off of intuition, but I would not expect this to be correct. If I were designing a rating system, I would include some sort of volatility to quickly readjust incorrect ratings. Players can end up with incorrect ratings (just registered, legitimate strength increase, sandbagging). You can watch Hikaru do “speedruns” on chess.com. He destroys a bunch of sub-2000 players and gains ~20 rating points each time. It sure seems like the system could be doing something more binary-search-like. A long string of wins or losses should cause the rating algorithm to decrease its confidence in its estimate and adjust that player’s rating faster. If glicko works like this, I wouldn’t expect inconsistent bot play to cause a problem by itself. With enough samples you can get a pretty good estimate of a consistent mean performance. I would expect farming (or even non-malicious practice), with strings of losses, then wins, to create even more severe rating changes in a system that tracks some sort of volatility, like @Feijoa suggested.

2 Likes

It could be, or it could be that it’s doing what it’s supposed to do?

In that simulation it’s because the player is “volatile”, they have a chance to lose to players way below their actual level, like the site bots, because they randomly play worse than their usual level.

Edit:

You can assign them a mean rating if you want to, but the performance is inherently volatile :slight_smile:

I would like to hear from people that understand or want to dig into glicko and glicko-2 specifically, as to what should be expected, especially in extreme cases.

I suppose I could try a similar simulation with the library @benjito was using and see what happens.