Testing the Volatility: Summary

I am referring to this excerpt:

Fully agreed. That idea is why I made the following suggestion:

since it might spread that “diffusion” process out even more and result in a lower perceived volatility for any single player. Whether or not that is desirable is up for debate.

Sorry for any confusion – I just meant applying further filtering for the purposes of displaying to the user, similar to what certain modes of the existing OGS graph already do.

ah this? I believe w is a parameter of the model, assumed to be constant across each and every player (similar to the c constant in glicko-1, which the WHR test even translates to w2.

Maybe, but I would begin to ask how this would be modeled without a solution such as what WHR implements (which I believe is also in some respect similar to what Glickman’s non-approximated model uses), or how to achieve reasonable approximations of covariances between players’ ratings

Definitely agreed – that’s part of why I like other metrics mentioned earlier, where models are instead judged by the “surprise” of encountering a certain game result given the model’s predicted win percentage rather than a binary correct/incorrect metric or similar. It might even be fine that many games have close to a 50% expected win rate. If a model thinks two players should have a 50/50 chance of winning against each other, that’s great as long as those players have very similar ranks. It should only be “punished” if it predicts a 50% win rate where there isn’t one, indicating some error in the ranks it has assigned.

It could involve maintaining a graph of recent interactions in the player base, and using a characteristic like graph distance between two players to weigh a parameter representing uncertainty about the difference between the baselines/average rankings of the respective “pools” those two players belong to. I fully agree that this would need to be examined with a lot more rigor, and might not produce any desirable results.

Yeah – this parameter seems to be a direct way of “tuning” the degree in which jumps in a player’s estimated rank over time are allowed, potentially controlling apparent volatility but also potentially reducing the predictive accuracy of the model if set too low. I do see that this “allowable rating difference” will also increase proportionally to time (as they mention and you mentioned earlier, “the variance increases linearly with time, so the confidence interval grows like the square root of time”), but in the short term using such a process as a prior in the MAP estimate should effectively limit the degree to which the estimated rating jumps when updated with games close in time. We are perhaps understanding the same thing, though I just wanted to relate it to this idea:

In this case, WHR seems to specifically control for this, but my overall goal was to point out in another way that models which try to globally optimize ranks to explain a given set of game results may do so at the expense of consistent individual rankings if not controlled.

2 Likes

Mkay, let’s imagine we find a modelling that makes some sense out of this, now players who play frequently with each other don’t move in ratings as quickly compared to against newer players, setting aside computation details…

How would this system account for, say, two pools of players A and B, of which pool B has a “true average” of about 100 elo-scale “points” above pool A, and then a small number of players within each pool (of which I will call them “diplomats”) play the diplomats of the other pool, and their rating moves very quickly to account for this, but when these diplomats go back to their “home” pool, their rating once again moves slowly, resulting in an even slower shifting of the pools to their proper rating

1 Like

Agreed that this scenario is the most problematic. We might be proposing two different uses for the new uncertainty parameter.

I was proposing that players whose “pool uncertainty” w.r.t. each other is high should actually update their rankings less quickly after they play a game than if they were playing within their own pool, since it’s potentially unclear what their actual relative rankings are.

I realize this would also result in slower resolution to the “proper” ranking:

I definitely realize this would lead to pools where the same rank represents different “true” abilities, with ranks only stabilizing over the very long term when these pools start interacting more and more often and the number of “diplomats” increase until the pools merge. Since the original topic of the discussion was reducing volatility though, I wanted to point out this major source of potentially unavoidable volatility, and show a (not necessarily feasible) way of addressing it (other solutions being a more global approach to like WRH, which comes with its own issues).

1 Like

Maybe? Personally I think if a major source of volatility is that you have pools (say, based on time zone or some such) that are not all “properly placed” relative to each other, the ideal solution would be to make those pools center to each other faster, since you will inevitably have many “diplomats” who play across many pools, and the more improperly placed their relative ratings are, the more strange data that the system will find to be conflicting in some way, especially if they frequently play against both pools

1 Like

Definitely! Again, I’m trying to illustrate multiple factors at play here, and how these relate to the overall goal of “reducing volatility”. In other words, these pool interactions may very well be the source of a decent amount of volatility currently found on OGS, and the quest to reduce volatility might have to accept that as a given. Trying to reduce it could lead to isolated pools like my hypothetical system (which could reduce individual volatility but at a high cost).

Then again, a more global approach might balance these pools very quickly as you mentioned and reduce volatility in the long run, but it could also be the case that such adjustments have to occur regularly given how the “landscape” of OGS players changes over time. I have no way of knowing at the moment.

(Something I haven’t noted yet in this thread that might provide more context – I actually personally like the current rating system, and think it does a pretty good job volatility and all)

1 Like

 I think now is a good time to reiterate that my intention with this thread has always been, especially in the short term, not to invent a new rating system, but rather to test if attempts to reduce volatility, through simple or even naive adjustments to the current one, might provide an improvement :laughing:
 (or at least a similar performance, in terms of predictions and matchmaking – and the purpose of this would be to help the rank match the existing cultural expectations of stability)

 Of course if someone comes up with a new rating system that matches the guidelines anoek has posed, nothing stops us from testing it, since we’re going to have to build a testing framework anyway.
 But that’s not an urgent priority imho, especially since I agree that the current system is fine.



 So to start putting the focus back on that idea, I’d be interested in an elaboration of this:

 Is there any metric that actually punishes such situations?
Even binomial deviance/log-likelihood seems like it would only “punish” wrong 50% predictions if you have a lot of them in the data, but any single one of them wouldn’t cause any more “surprisal” than a correct 50% prediction, right?



Your recent comments (about player pools and such) led me to these two thoughts:

  • assuming a framework inspired by the Elo system, where the rating system uses wins and losses to adjust rating estimates, I feel like the only way to solve the “player pools” issue (and many other issues, including ones that affect the current system) would be to fiddle with the matchmaking, perhaps more so than the rating algorithm itself (essentially forcing the pools to merge as much as possible, and more generally matching players so that the match maximixes information gained);
     and I feel that we don’t really have that luxury on OGS – though that depends on how many people use the automatch system. (Well, we can propose changes to it, but realistically it won’t change too much)

  • otherwise, the only other idea that comes to mind is another one for a ranking system that I’ve had for a long time, along with other people: creating ranking bots that can look at a game and guess the player’s level (now this might be related to fiddling with latent spaces or feature spaces :laughing:)


So I thought about the brownian motion thing, and I think there's an interesting quirk of the idea.

(My understanding of the relationship between brownian motion and Gaussian distributions was of course incorrect, but I don’t know if it’s useful to explain how)

 First, a feature that I think does make brownian motion at least partially unsuitable as a model for “true rating”: unless most of the population of players improves except for a single player, you shouldn’t expect the “true rating” of a player to decrease too much over time, except for momentary slumps.
 Whereas obviously it’s quite common for a player, especially in the lower ranks, to improve, so for the true rating to increase over time.

 This intuitively leads me to think that a good model for a player’s true rating should be somewhat biased towards increasing, or at least against decreasing, whereas brownian motion isn’t biased, not even to staying stable.

 Still I reached the conclusion that brownian motion is in some sense a good model, but only because it’s “pure randomness”: essentially, by modeling something as brownian motion, you’re saying “we need to be ready for somewhat sudden jumps and whatever happens”, something like that.

 But I think this is a kind of sidestepping the problem of actually finding a good model. It doesn’t mean we should actually think that the rating doesn’t have a center of undulation. It just means it’s intrinsically noisy and that the algorithm needs to be able to react when the center of undulation does in fact change.


 I still stand by the starting assumption of my intuitive reasoning:

 And especially since we can’t know if any high-frequency noise we detect is part of the true rating or just caused by the various sources of error, I think trying to chase the noise is a bad idea, and I think instant ratings have to deal with a very strong trade-off between chasing the noise and not being reactive.

 The analogy with waveforms was just in terms of improving the signal-to-noise ratio, since my mental model for the measuring of the true rating is that of a noisy function with a stable center of undulation, which is the variable I’m interested in.

 So of course I know basically nothing about information theory :laughing: but intuitively, looking at more info at any one time just seems more likely to draw more accurate measurements; it’s what they do in experimental physics, why should we expect the idea not to apply at all here?

I never checked how heavy the separation is.
The reasons for clusters in the player base I think are the most likeliest are:

  • Prefered boardsize
  • Preferred speed settings (live vs correspondence)
  • Different activity hours (due to lifestyle or timezone)
  • Preferred matchmaking method (automatic, custom games, tournaments, ladders)
  • Groups (school classes, …)

Besides removing some options completely, the only relevant change would be to allow automatch to select open challenges.

On the other hand I’m not sure clusters are a problem either.
If the clusters are separated, then matchmaking happens only™ in the cluster and in the cluster the rating behaves as if there where no other clusters.
If the clusters are heavily linked, the links balance the small rating movement of one cluster with the other.

3 Likes

Well, if I understand correctly the worry is when the situation is in the middle, so that the clusters are not linked heavily enough to “normalize” relative to each other, but not separate enough to function independently, and the main concern was in terms of this being an inescapable source of volatility (since most interactions between very different clusters would cause the rating system to be very “surprised” and have to adjust the rating to the new cluster).

1 Like

We probably don’t have the data needed.
We don’t have the ranks of the players. If we would have them, we wouldn’t discuss a better method to estimate them.
To estimate the true™ win rate between two players, we need a lot of games between these two within a time frame where we expect rating changes to be marginal.

One can test this by identifying the clusters and then comparing the quality of the ratings in the clusters to the links.
I wouldn’t spend to much time in solving a problem I might not even have.

Could you make a list of the top ideas you’d like to test out?

I mentioned adjusting parameters in glicko earlier (probably the easiest modification of the existing system to implement) but I believe you said that wasn’t what you were interested in.

1 Like

The reason I hesitate is just because of terminology and the specific problem domain. When talking about a signal to noise ratio, the signal in question is usually very well defined, with nice properties we either know or can try to estimate (frequency, phase, etc). The noise itself is usually pretty well quantified too (e.g. white noise). It’s also generally clearer how this signal is combining with the noise (additive, multiplicative, etc).

In this case, it’s not immediately clear to what extent the “true” signal exists or how it might be described, or how the complex interactions resulting in the more “noisy” data we’re seeing might be quantified. Statistical concepts (like those that have been discussed – variance, random variables, confidence intervals, etc) are probably more appropriate. Not to say that there isn’t overlap (there’s a ton between statistics and signal processing), just that the analogy to signals might create some unwanted expectations.

All that said, I realize this is nitpicky and fully agree that the concept of a more stable rank may be obscured by “noisy” deviations.

The types of metrics I’m referring to are those similar to the WHR paper based on a raw proportion of correctly predicted games. On a dataset where everyone is somehow of the same “true” rank and wins nearly exactly 50% of the time against everyone else, a model that puts everyone at the same rank and predicts a fifty percent win rate all the time will of course get a score of about 0.5 under such a metric, as you said. A model that adjusts ranks frequently and somehow captures more than 50% of the game outcomes would get a higher score, even though the former model is actually what we want. I’m just saying that the other metrics without this property are better, as I believe you’ve been saying all along.

(Edit: ahhh… another potential issue here – I’m talking about the potential to accidentally and unfairly punish correct 50% predictions. I realize my original statement was open to confusing interpretation, maybe the revised one below helps)

This is probably obvious and has been brought up again and again, but the reason I wrote my thoughts out again was because of these statements:

I wanted to bring up the extreme case described above as a way of showing that the key to judging a good model (if the goal is to judge one by predictive accuracy) might be how well it stratifies players whose observed win rates are significantly different than 50%. I fully agree that a dataset where most matches are between players of similar skill could be problematic for this very reason, and I agree with @flovo that there may not always be enough data for a given player.

We probably agree and are saying the same things in general, just have different implicit assumptions or interpretations of each other’s posts (forums can be hard!).

Yes, this is something I’ve tried to be very clear about all along. Maybe this is clearer if I say

This may be a very obvious statement, but worth stating in the context of designing a metric for evaluating a model. I also realize, as you say, that

It’s perfectly possible that a system might happen to observe two players with a ~50% win rate, thinking they have the same “true” skill, when in reality there just isn’t enough data.
This is yet another reason that judging a ranking system by its predictive performance is potentially problematic, and that the evaluation metric needs to be very carefully considered.

Throughout this whole topic, I am taking the following assumptions:

  1. The goal is to reduce perceived volatility of rankings
  2. Systems will be (at least in part) judged by their predictive ability, by some metric

I am trying to point out various issues I can foresee with this approach, including how we define such a metric. Given various types of metrics, I am pointing out potential issues. In that light:

Will this necessarily happen? Not at all, but it is worth bringing up, not necessarily in an effort to solve it, but to further discuss the end goals and compromises/considerations that may come up along the way.

I apologize if any of this sounds terse, I just understand that forums can be a difficult way of communicating and want to make things as clear as possible. Often we all agree but conversation devolves because it is hard to keep track of everyone’s past statements.

2 Likes

Is there a way to access the list of game results of all ranked games played on ogs?

Not the sgf, just the time the game ended, board size, time category, handicap, player id and result.

1 Like

The history of rated games is part of the goratings repository.

3 Likes

Ok, I do need to clarify that.
 This assertion was based on my understanding that the Glicko volatility parameter (call it GV) is either unused or useless in the current OGS system – since my current understanding of GV is that it’s calculated based on observing the “incoherence” of the set of game results in the entire ratings period, which means that with an instant ratings period, like OGS has now, it can’t work as intended (a single game result is always “coherent” with itself).

There may or may not be another parameter compensating for this, but then it wouldn’t be GV, AFAIU.

This should make the following less surprising:

For now, the ideas I’m mostly interested in are:

  • Glicko-2 with a “fixed” window system, also testing out different window sizes, but mostly based on time, since AFAIU it’s how the model was intended (for example, daily update).
  • Some type of “naive” smoothing, or stabilizing algorithm, applied to the current rating system – perhaps with one parameter that regulates the “amount of smoothing”, so that we can test what degree of smoothing seems to perform best in terms of win probability predictions (might be 0% smoothing, of course :laughing:)

  • The current system, but with some fiddling to reduce the deviation after the starting provisional period (because I expect this to “chase the noise” less).

As far as evaluation metrics go, so far the ones that interest me the most are the ones described in the first post in this thread, but I encourage further suggestions.

How would you evaluate the “quality” of a rating system, or even better, how would you quantify the “usefulness” or “necessity” of the current level of perceived volatility? This was the question that started it all.

EDIT: Oh, I just remembered there are also these questions:

I’d like to find out if there’s something that we can do to limit the negative effects of correspondence games specifically (caused by the “delay” between the moment the moves are chosen on the board, and the moment the rating update happens).


While of course I don’t want to dictate what others might be interested in, I want to make it clearer that I’m not sure this perfectly describes my own goal.

 Or rather, while I appreciate that @joooom is bringing to the table valid theoretical concerns for why volatility is inextinguishable (at least without causing other issues), and might be already minimized in the current system, what I’m interested in for now is just performing the experiment – what does the theory say, ok, but what does the data show?

 So, yes, I’d like systems to be (at least in part) judged by their predictive ability, by some metric, with the goal of testing whether there’s a way to reduce the perceived volatility without significantly damaging the performance. (Well, that’s my personal aim :slight_smile:)

(and again, suggestions on different metrics of judgement are welcome, those are just the better ones proposed so far)

After that, recapping what I already said:

  1. if such a way is found, this might allow making more users satisfied with the system by way of, inhale, meeting the cultural expectations surrounding ratings and especially Go ranks.

  2. if such a way isn’t found, or it’s found but doesn’t really stabilize ratings too much, I will keep proposing to separate the rank from the rating, keep using the rating for practical purposes where possible and artificially stabilize the rank.

2 Likes

I hope the past discussion has at least been somewhat helpful in illustrating potential pitfalls and things to consider! I’m sorry if I derailed it a bit. My last note on that for now then: it might appear that a less volatile system performs worse in a predictive sense, but, depending on the metric chosen, that could be misleading for any number of the reasons we’ve discussed above.

For that reason, it might be useful to quantify predictive performance by a number of metrics on both the old system and on proposed changes, so as to get a fuller picture.

2 Likes

This is not true, as the volatility parameter updates do not care about inconsistencies within a ratings period. Volatility is a measure of a sort of “speed” of the aforementioned Brownian Motion process of “true” ratings. It is modified by a few parameters: the prior RD, the expected variance of games played (without regard to the games result), and a sort of “estimated ratings change” that uses the variance as a parameter. While I’d need to do a bit of analysis to see how these parameters affect the volatility update, it is clear that the estimated rating change works counter to RD and variance’s effects

Keep in mind that ratings periods are treated as if all games within them are played essentially simultaneously, so it’s assumed that any inconsistent data within them is the result of the “performance distribution” centered around a single true rating

2 Likes

On my experience, glicko volatility is always near the initial 0.06. It only deviates for player with an very unbalanced history (most opponents stronger/weaker). In this cases glicko volatility increases an unhealthy amount, leading to a highly volatile rating. This is to some extent counteracted by not allowing the GV to be bigger than 0.15.

2 Likes