Testing the Volatility: Summary

it tends to be caused by this, but the nature of the bias-variance tradeoff is that it’s a consideration when you’re considering any dimensionality at all, for example adding a quadratic term to what was previously a linear best-fit analysis creates a higher variance scenario whereby the quadratic model may cause an overfitting of the model to the data that has been explored, whereas a linear regression might not capture the true relationship on either the training data or the test data, and in such case would be “biased”

One way to test a comparison of models for overfitting is to divide data into “training data” and “test data” (in ratings systems this is often done by having the test data come later due to their time-dependent nature), whereby you tune all the parameters to the training data, and then evaluate them based on the test data.

Now, I’ve been told that due to the matchmaking system’s reliance on the model, that not incorporating this can still miss some of the overfitting, but that’s a whole other topic

Now, I might ask: what exactly are you looking to tune in terms of the volatility? The initial parameter? The tau constant? its time-based relationship with Ratings Deviation?

From what I can tell is the OP wants to test the subjective volatility, and is considering creating a new system on top of the existing Glicko model (thereby creating a new model) and as such you’d need to tune all the Glicko parameters to the training data in both the unmodified form and its modified forms in order to properly gain a comparison, plus any parameters added to the top of it (such as any coefficients used in an averaging of previous ratings to gain an idea of current rating)


We could:

  1. Come up with some of those parameters, “reading”, “fuseki”, “direction of play”, “estimating the value of moves”, “estimating the score”, “making good shape”, “invasion”, “reduction”, and so on. Some parameters may be redundant/overlapping, that’s ok.

  2. With the help of a professional player, come up with a set of problems meant to evaluate all those parameters.

  3. Put all these problems in a form, along with the questions “What is your name and what national go association are you a part of”, plus “On which internet server do you play regularly and what is your name on that server”

  4. Use the problems to position each participant in the feature space, and use the records of the national associations and internet servers to find who wins against who.


Basically. In fact, funnily enough, Glicko-2 volatility is one parameter I don’t necessarily care about, since if I understand correctly it’s implemented in a bit of a weird way in the current system – it does seem marginally useful, but not too much. I feel that we could ignore it and the system would work almost entirely the same way.

 I don’t know how to explain this, but basically: people don’t like the perceived volatility, and it might be bad, and if that’s true we want to “fix” it.

 Then, I’m not sure exactly what exactly the role of volatility is in the real Glicko-2 model, but AFAIU, in the current OGS rating system, it’s exactly the opposite of what we want: the “volatility” parameter (or the parameter substituting Glicko volatility) measures how “spread out” the recent rating has been, and if it’s really spread out, it makes it even more volatile by increasing the deviation.
 (I think based on the reasoning that a high “spread-out-ness” means either that the player is quickly getting better or worse at the game, or that the rating is otherwise very inaccurate and needs to be adjusted quickly.)

On a related note, I have yet to finish reading the Glicko papers, but the impression I got so far is that the model isn't very useful for the use case of an online platform such as OGS, unless implemented in some specific way.

My understanding is that the Glicko models were built to fit the situation where:

  • players don’t play many games
  • they take the games they play quite seriously (focusing on winning and not on learning or teaching, for example)
  • the games are played in clumped-together groups but otherwise somewhat spread out in time (such as real life tournaments).
The second issue (the assumption that players focus on winning more than having fun) is in practice unsolvable I believe (not just when trying to adapt Glicko to OGS, but for any rating system)

, since this is an amateur platform where people come to have fun (which includes but is very much not limited to competing and testing their skills), so we’ll probably just have to accept that the rating is going to suffer from that – there might be weird ways to parametrize that too, but as we’re discussing, the simpler the better.

 By the way, in the recent threads people seem to keep telling me this is not true, but nobody has brought any convincing argument in support of that, or against mine: in a situation where players focus on doing their best to win and doing that the best way they know how, the “strength” of their performance should be much more stable than a situation where they’re just having fun or exploring new possibilities.
 Any exotic argument such as “you might inadvertently stumble into better moves when you’re playing in a relaxed way” is completely irrelevant, because most of the time that is not the case. If you keep your playing style stable, you’re expected to have a more stable performance. So changing your playing style when playing against weaker players, for example, is going to give a harder time to the rating system.

The clumped-together aspect might be adaptable to OGS by only updating the rating, say, once a day, using a rolling window system with an unfixed window size (equal to the number of games that day). If you want to satisfy the player’s curiosity to see an immediate improvement, you could have a separated “temporary” rating that updates game by game, but without affecting the “true” daily rating update.

I think this assessment, while arguably being the case (although in reality it doesn’t account for how much your rating has moved, since it doesn’t move during a ratings period), implies a positive feedback loop where it does not clearly exist (we do have volatilities that settle at a reasonable level)

Only the last is really important for the model (as that’s what makes ratings periods easy), and this is why I don’t complain too much about instant-time ratings periods on OGS even if they are higher variance

EDIT: I suppose I might as well justify the “taking games seriously”, and this actually goes back to the original Elo assumption that your performance during a game is already a distribution that includes taking it more seriously than not, although one could argue that different players have different levels of variance upon this scale, and accounting for such might suggest the need for another modification to the E function

I’m curious what exactly this would garner in advantage over a simple daily updating period, as it runs the risk of having a similar problem as the “15 ratings” problem of the previous rolling window solution

RE: @ArsenLapin1’s idea

That sounds like a reasonable plan – but there exists the possibility that there might exist parameters that humans don’t have concepts for yet.

Anyway, the whole point I was trying to make with that post is that, unless one really wants to venture that path, there’s no need to actually understand the latent space of “Go strength” in order to have a meaningful system measuring a player’s ability for matchmaking purposes.

So while I do have over-ambitious ideas to create a system to study the latent space or take it into account, I think for now I’d better focus on simpler things that we can realistically understand and control to possibly improve the current situation :laughing:

Well, it is a simple daily updating period, AFAIK “rolling window” means what I’ve seen called “fixed window” here on OGS, and it’s different from “sliding window”, though I might agree it’s a bit ambiguous as a term.

I guess I should have either been more clear or just avoided the details altogether :laughing:

Well, that’s what I meant: my hypothesis is that it does, in the current rating system, which is not Glicko-2. I suspect currently there’s a parameter that tries to make up for the fact that the rating window is 1-game by actually measuring the variance of some quantity of past rating changes, and uses that to update the deviation.

I think some amount of feedback loop does exist, and it might have something to do with players who play games more frequently apparently having a more volatile rating, but it all depends on how exactly the calculation to update the deviation is performed (it doesn’t seem to be a simple proportionality between past variance and updated deviation, for example).

And also of course I might be wrong.

What I can say from the anecdotal evidence I’ve seen so far is that when a player is on a winning or losing streak, the deviation seems to start increasing.

Now that I think about it, I guess this might counterintuitively help keep the volatility at bay because eventually the player is expected to get out of the streak, and because of the increased deviation, this pushes them back towards their “average” rating more strongly. (But intuitively, this sounds like a weird system)

Perhaps, rather than the deviation increasing with variance, it increases because of winning or losing streaks directly.


This actually has a number of reasons, but the key is that “streaks” often include multiple upsets that indicate that cause model to see an indication that it knew less about where the true rating was than it thought previously (after all, it’s drifting further and further to the fringes of the distribution), and is actually a big part of what the Delta and v variables are accounting for in ratings period updates


I think one big issue here, is that you’re asking for quite a lot of work to just convince you of… something.

Help me understand and check that this repository is correctly implementing glicko2 and also that this is the current code being used on the site.

I think you could get a reasonably easy yes or no on the second one to an extent from someone like @flovo or something when he’s around.

The first one is a big ask though I feel.

I know you’ve written a lot here, but and I think you’ve already said this, that it’s not clear why the system isn’t working as expected. Is it just that players ratings go up and down based on whether they’ve won or lost games that’s the issue?

Maybe you should read back through the ratings updates posts from anoek. They don’t paint a picture of something that was just randomly implemented without lots of testing. Just because all of that testing process wasn’t catalogued on display in a public forum, in an easily digestible manner, doesn’t mean it doesn’t exist or wasn’t done. I could be wrong also, but again I would start reading the previous updates 2021 back to 2014 to get a feeling for it if you want:

Actually though I do appreciate that you are working through the glicko papers.

It’s something I’ve wanted to do also. I’ve found that some of the pdfs are more descriptive and example giving, useful for people trying to implement the system, rather than postulating some assumptions and deriving it.

I think there is a journal paper if I recall that goes into more of the assumptions but I think I needed some more stats knowledge to understand it, but I would like to know more and understand it.

Yes we use the code in goratings/goratings/math/glicko2.py at master · online-go/goratings · GitHub almost verbatim, the only differences are around bindings to internal data structures and whatnot, if it’s a concern I’d be happy to send along a diff of the two versions. (I can confirm there’s no algorithmic differences in the diff, just things like we name things “v5_Glick2Entry” instead of just “Glicko2Entry”, stuff like that.

Note we don’t use the gor implementation in that repository at all, that was just something during testing to see how things stacked up against gor.

First off, I think it’s awesome you’re taking a stab at a better rating system. Glicko2 is better than my prior attempts, but as you’ve identified, there are some behaviors of it that are not particularly desirable. I’d welcome a better system if one is presented, noting though that the player experience is very important too (a technically better system that offers a poor experience probably isn’t going to be used, unless the benefits are just so astounding that it makes sense)

Some notes to consider when thinking of different systems, at least as it relates to any possible integration into the OGS system.

  1. There can be a difference between technically better and fair, as perceived by a player. Most players in practice prefer “fair” over marginally better but “unfair” ratings, specifically:
    • In general when this topic has come up before, players don’t like their rank drifting when they’re not playing
    • When you lose, your rank should go down, when you win, your rank should go up. Staying neutral is fine too for cases where the rank difference is high.
  2. Players want their ratings and ranks to be adjusted immediately, at worst, a small delay. (We have a small delay, but it’s usually sub-second unless things are backed up, this is plenty fast enough). This is a better experience than something like doing a big rating update at the end of either time or games played windows.
  3. We can’t count on users setting a good rating/rank when they join, the system needs to quickly find a suitable rating and rank for them.
  4. The target of having a rank difference be a good estimate for handicap is important
  5. It needs to scale to many millions of games
  6. Computing a rating update can look back a few games if that’s useful, but it can’t consider the whole history as that’s too heavy of a thing to compute for some players and bots.

That’s what comes to mind at the moment. Good luck, let me know what you find, and PR’s are welcome on goratings! :slight_smile:


Strength of human changes with time. There are bad days and good days. I think its important for ranked system to detect fast when you have bad days and give you lower rank fast. So you play weaker opponents on bad days and don’t have lose streak. Such system inevitably will lead to what some people call “high volatility”. But, is it bad thing in such case ?

Thank you for taking the time to type up this assumption about OP’s goals and explanation of why upsets might be a normal indication of the model adjusting – I think these are some of the key points and assumptions I was failing to convey in my earlier discussion, and I’m sorry for the confusion.

@espoojaram The whole point of my thought experiment was to illustrate that ranking systems or models constructed to accurately predict game outcomes are potentially more prone to overfitting than other domains, since a model whose goal is to stratify players in such a manner will have to account for cascading effects on everyone’s rating whenever an upset, new interaction, etc. occurs. If you construct a rating system in such a way, you may need to add many new parameters which may improve the accuracy but lead to further overfitting (as @meili_yinhua mentioned). If you get this model to be extremely accurate at predicting outcomes in games it has seen, it will be very likely to learn all kinds of idiosyncrasies that may not hold in the future. When you want the model to continue having this high performance on future data, you are essentially “re-training” it and each new iteration will have to account for upsets, new interactions, new players, etc. which are very common, especially on a large and growing server like OGS. This is the variance part of the trade-off, and can manifest as high apparent volatility for the user. Again, I’m not saying that a model with lower volatility and pretty good outcome predictions isn’t possible, just that you have to be careful using accuracy of predictions as a metric for a “good” system if the end goal is actually to reduce volatility.

All that said, it may end up that you don’t add any new parameters, or that whatever model you test isn’t sufficiently complex or prone to overfitting that maxing out prediction accuracy ever becomes a problem in terms of volatility. It could even end up that the model captures some underlying, relatively static truth about ranks and game outcomes, but I think this is more unlikely simply due to the nature of human variability I was so intent on earlier.

I think I’m beginning to see why there was so much conflict centered around my focus on human variability and the elusiveness of a “true” rank. You are probably right that players with consistent, serious style will have more predictable performance that can be captured by some model. I just wanted to provide the counterpoint that humans can never be perfect, and inconsistencies related to the various factors I mentioned (luck, playing style variation whether intentional or not, good or bad days, etc.) are part of the irreducible error that any model will have to face, which may lead to overfitting or volatility. I may have focused too much on a single one of these factors or taken my examples to extremes in attempt to get my point across, and I apologize again.

To continue explaining in as much detail as possible, my concern here was with the desire to map from a proposed “schmating” in [0,1] back to a multi-dimensional feature space. This step is highly prone to bias, overfitting, and any number of problems that other models face. I know you acknowledged this step to be infeasible, and propose to instead estimate the schmating from win rates, but I just wanted to point out that such a system is also prone to similar problems, whether or not it is backed by a theoretical schmating, simply because it is an estimate constructed from real-world data.

Again, this is not at all discounting your ideas – I think they are great. I just wanted to provide some additional context and warnings from my own experience. My last comment on this idea for now:

@ArsenLapin1 brings up a very feasible and interesting way of exploring such a feature space. I alluded to this earlier, but to reiterate: generally someone who wants to construct a latent space (given a feature space generated as ArsenLapin describes) will take the human-understandable vector space with axes representing traits you know and try and generate a different latent space which maps the original points so that they lie along a manifold in some meaningful manner. In an extreme case, this generated manifold could actually just be a one-dimensional line, very similar to your schmating! This is part of what makes this concept so cool and powerful – rather than trying to manually map from the high-dimensional vector space to a lower-dimensional clustering or even rating, you can use real-world data to generate a (potentially better) mapping for you. Generally, however, the latent space is generated without any guidance as to what features make two points similar, which may not be as useful in this case. You may find that a supervised approach is more useful, where the goal of the mapping is to take the points in a high-dimensional feature space and map them to a scalar value (in this case, maybe something like “strength” as quantified by the percentage of the player base a given player/point is expected to win against). I bring this up not so much as a correction of terminology, but to add another theoretical tool to your arsenal.


Btw I noticed something with the system (credit to @espojaram for pointing it out to me) and I wanted to ask about this while you were paying attention to the thread

Assuming this is identical to how it shows up in this code, how often does step 6 actually get taken? The usual answer is “every ratings period which is distributed equally in time”, but with instant-time ratings periods, those are no longer distributed such over time.

How does the system account for the fact that players who have not played in a long time should have higher RDs?

EDIT: I suppose I have found this

But that still leaves the question of how the periods are supposedly counted


We don’t batch matches (I did experiment with that, which is why you’ll see some helper code and whatnot, but it’s not used.) so step 6 is run after each game. In practice, the matches list is just one entry, the game that was just played.

I also experimented with deviation expansion, both time based and number of games played kind of thing, if I recall correctly, but in the end I ended up not using it, so that method you posted is not used in production.


I don’t know if I should try to respond to every point brought up by everyone, but first things first:

I want to thank @anoek for taking the time to write such a thoughtful reply and for the kind words of encouragement.

And of course I also want to generally thank everyone for bringing their knowledge and opinions to the table.

I agree, and in fact, on the practical side of things, better user experience is my main goal in this endeavor.

Since many people seem to be confused about this, here's a clarification. (TL;DR: I'm under the impression that rating volatility makes for a worse user experience independently of the matchmaking, because of the cultural expectations surrounding ranks and other psychological effects.)

I am currently under the impression that for a lot of people, a volatile rank makes for a worse user experience:

When this whole conversation started, I was actually in the same camp as @stone.defender (and maybe @joooom?): I was the one saying “but maybe the high volatility is useful to allow for better matchmaking, and the rating is just catching on to actual fluctuations in a player’s skill”, and other people were skeptical.

But then it just made sense to be honest and recognize that “if that’s the argument we keep giving to justify the presence of the volatility, can we at least verify that it is indeed helping the matchmaking?” and here we are.

(Of course there’s the other, much better, argument for the presence of the volatility, that @joooom has brought up a few times, namely that reducing volatility would necessarily sacrifice something else because of the mathematics involved, apparently – this should hopefully be addressed, in terms of user experience, by the rating-rank separation I proposed in the above quote)

Since user experience is my priority, this is actually the question I’d like to find answers to most urgently: what do the people on OGS and generally visiting Go servers want?

Anoek touched on this, for example

It sounds like these claims are backed up by at the very least anecdotal experience. Have surveys been conducted in the past? Could anyone point me to a source?

(To be clear, I’m not surprised at all about other claims such as “players don’t like their rank drifting when they’re not playing”, “when you lose, your rank should go down, when you win, your rank should go up”, and I’ve seen them being addressed in past forum threads; it’s especially the above ones that I’d like to verify if possible).

Also, while it may be surprising, it’s not obvious for me to predict, based on the boundaries that have been specified, which of my ideas would make for “good user experience”.

Question related to the above, and to two incompatible viewpoints I'm aware of about rating volatility.

 If we did have a system with a “(possibly) cosmetic, stable rank” separated from (but also possibly calculated from) a “volatile rating”, would it be “good user experience” to put an option in the user settings, allowing to choose which one of the two systems to use for matchmaking (specifically the automatch system)?

 I ask this because I know of at least two people with apparently diametrically opposite ideas about rating volatility: @stone.defender apparently believes it’s an accurate representation of a player’s skill actually fluctuating, and another player, who may or may not want me to quote them, said that they stopped playing on OGS because of their perception that rating volatility is not an accurate reflection of players’ skills, especially in higher ranks.

So, for example, I believe these users would choose different options if they were allowed to, and they would feel happy about their choice…

…but then again, I just realized this would mean splitting the automatch pool in two.

So no option for the automatch, I think. BUT the users could be allowed to restrict rank (in the custom challenge settings) based on whichever system they believed was more accurate, would this be good user experience?

Edit: Well, I wrote a lot of this because I was curious about what anoek might bring to the table, but he has decided to mostly not comment on this :laughing:

Not a problem, but it does mean this one post is very off-topic relatively to the specific objective I meant for this thread (discussing the methodology for testing the volatility), so if anyone has opinions about what I said in this one reply, do express them, but please open a new thread to do so :slight_smile:

I’m going to mostly stay out of the conversation about theories and whatnot, but I’m happy to answer specific questions about the existing system.

On the topic of volatility:

  1. I would love to have a system that is less volatile
  2. That system needs to still quickly help people find their ranks initially and after hiatuses.

Finding a balance between those isn’t the easiest thing.

The other thing to note is correspondence games have some edge cases that kind of throw a wrench in things, such as people who play a lot of concurrent games sometimes go through a purge games they’re losing (by resigning a bunch), or sometimes wander away and timeout of all of their games, then come back and start playing again a year later. Furthermore a players strength can change a lot over the course of a correspondence game, some games go on for years.

The point being, a production system needs to be somewhat resilient to those kinds of situations too, which basically boils down to another case to be concerned with in point 2.


Oh! The correspondence thing is actually something many people have wondered about, it would be great to have an answer to these two questions:

  1. Does the current system calculate the rating from the player’s rating at time the game started, or the time the game ended?

  2. Has it ever been tried to “interpolate” the initial and final rating of the player to try to compensate for that problem?


as far as I know, no, but I have once theorized a method to do so that achieved little note



I don’t believe I have tried doing any interpolating during my various tests, I certainly tried start and end, and end was better.


To clarify this one more time in terms of the proposed improvements, without delving into too much implementation or mathematical specifics, here is what I am saying:

If one goal is to improve matchmaking, how are you measuring this improvement?
If this metric is in fact how well rankings can be used to predict game outcomes, then there is a potential issue:
> A model or ranking system which performs extremely well in terms of predicting outcomes within a given player pool and set of games (“good” matchmaking) will generally be overfit to that snapshot
> If you want this system to adjust to account for new games (as happens daily on OGS), increased apparent volatility may result

So in the quest to improve matchmaking and reduce volatility, just be careful to realize these concepts can conflict. That doesn’t mean there isn’t a good balance, however, or that other means of quantifying a “good” system don’t exist. I also like @espoojaram’s idea of splitting a more stable rank from a matchmaking ranking. This is just one of many balancing acts that needs to be performed for a ranking system, and @anoek mentioned many others.

Of course this balancing act will inevitably come to light as implementations and actual data come into play, but I wanted to ensure that this concept was brought up now for one very important reason: earlier parts of the discussion were suggesting that new systems should be judged primarily by how well they make predictions about game results.