Testing the Volatility: Summary

Alright, I guess we can go back to annoying the rest of the forum community, or being made fun of from them, for writing long technical posts.

It’s not necessarily relevant for the “shorter term” goal of just testing the volatility itself, but it’s useful to discuss this for the potential longer term goal of proposing an actual new rating system.

What do you think about the idea of tuning, if necessary, the system on a subset of the past data and then performing a “test drive” on a different, or expanded, subset of data?

In fact, if needed, such a system could even be expanded to multiple steps, every time expanding the dataset further and seeing how solid the system stays.

Since it’s a bit of an obvious idea, I imagine you have your gripes with it?

But I don’t understand, if you only ever receive feedback once from a specific dataset, doesn’t this very significantly reduce the risk of the tested system being overfit to the data?

(Of course it would be difficult, and mathematically impossible, to prove it, but statistically speaking, there’s gotta be a point where you go “alright, this system is solid enough”)

1 Like

Yup, that’s a standard way of addressing the problem and is generally a good idea. Now is where we get back into more abstract and contentious ideas including the existence/nonexistence of a stable, “true” rank :sweat_smile:

My gripe is that I strongly suspect, even if certain systems perform well on unseen data, the highest performing (in terms of predicting game outcomes) of these systems will have unwanted rank volatility. This is where there’s a bit of a muddy area between the bias-variance tradeoff in a statistical sense and user perceived volatility – ranking systems are weird in that we’re treating them as predictive models (should player A of rank x beat player B of rank y?) as well as a constantly-evolving system that assigns a number to each player for other reasons (showing improvement, giving users an idea of their skill, etc.). A ranking system could have low variance in a statistical sense (in that it is not sensitive to the dataset and outputs good outcome predictions on new, unfamiliar data or additional games that cause ranks to evolve), but my argument is that, in order to continue to make these very accurate predictions based solely on assigning a scalar rank to each player, such a system will actually have to drastically alter rankings internally every time new data becomes available. New users will be added, upsets will occur, and people who have never played each other before might start interacting. Cheating is also a confounding factor…

In other words, I do not believe it is possible for a model to make very accurate game outcome predictions over time, while also maintaining a stable rank for players that only changes significantly when their “true skill level” changes. This is why I support your idea of separating matchmaking rank (or any system that wants to predict outcomes) from user-displayed rank.

This comes back to the point I was trying to illustrate here. We’re essentially handicapping a predictive model by requiring it to perform predictions based solely on whether a single scalar value assigned to one player is higher or lower than a single scalar value assigned to another player. This stratification may not even be possible in many cases if we add reasonable expectations of a ranking system (like “players of rank x should win against each other around 50% of the time).

Once again, this doesn’t mean there isn’t a balance, just that we have to be careful.

Forgive me if this is a misinterpretation of your words, but this sounds to me a lot like it’s referring to a sort of “maximal accuracy of a biased system”, but my argument is that if one takes proper precautions about making sure accuracy of a model is measured on data that you have not trained on, you also get to see how inaccurate highly variant models perform, even if they inevitably do better on training data. The key is to realize that it moving based on pure variance of random outcomes inevitably leads to erroneous predictions, if it does this more often than the accuracy loss of a biased system compared to perfection, then it will similarly perform poorly

Now with ratings systems, there is mention on the “test vs production” issue, but we have attempted to remedy this somewhat by obtaining multiple datasets from different associations with different matchmaking and rating systems

1 Like

Quick fix for perceived volatility i once suggested would be visually showing the current “8.2k ± 1.4” format on profile pages as “9.6k – 6.8k”.
And expand the rank after usernames for “name [10k – 7k]”

A dash is better for understanding that ogs ranking system is spitting out ranges where it thinks the rank should be.

I feel like main reason why users think ogs ranks are volatile is they are only focusing on that single digit at the middle of their range (duh, thats whats being shown) instead looking at the full picture. Rank which jumps constantly between 8k and 10k feels really volatile, but rank which hovers between “10.8k – 7.2k” and “8.4k – 6.6k” looks lot more stable.

I’ve actually suggested this thing few times before, but i guess nobody really like the idea as it doubles the amount of numbers on screen. And i agree that it would prolly look like clutter >__>

4 Likes

I don’t think so, sounds like a nice, concise way of putting it. I’m saying that a more stable rank is probably a biased estimate of a more nebulous and fluctuating idea of a “true rank” – at least as it pertains to predicting game outcomes. The only thing I want to emphasize is the key word “very” in this quote:

I’m using “very” to mean “extremely”. As you’re saying, I definitely think a decently accurate model can be made with a potentially biased, low-volatility model.

I realize that whether or not my hypothetical high-volatility, very accurate system even exists is up for debate too! I just wanted to caution against going with one if we do find it, since it will of course increase rank volatility. Just another way of my pointing out that decreasing volatility and increasing matchmaking by way of predicting outcomes could be conflicting goals that need to balance.

Once again a very good way of cutting to the heart of this. I believe that the accuracy loss of the biased system will outweigh a less biased (more volatile) one, but that’s just my speculation.

One last point to clarify:

This is another source of bias I’m trying to point out (which may or may not be the one you were referring to). In other words, given a set of players and game records and completely discounting evolution over time and generalizability to other data sets, it’s perfectly possible that no assignment of ranks to each player will accurately capture expected outcomes, even within that “training” set. I.e. it might be impossible to construct a hierarchy such that given any player A with a rank higher than player B, player A’s win rate against player B is higher than 50% (or however you want to quantify it).

This could also lead to volatility – maybe the “best” hierarchy in terms of explaining challenge outcomes completely sticks player A in the “wrong” rank. A future update of the model might have more games involving player A, and it could happen that improving the accuracy of player A’s rank is a good way of improving overall accuracy of the predictions. Now player A will experience a lot of volatility in their rank, not because they changed skill or had any upsets or events of randomness but just because the model is trying to optimize for predictive accuracy in a biased manner. Again, just another caution against trying to maximize predictive ability. (Of course, a lot depends on the implementation specifics of any ranking system.)

Fully willing to be proven wrong on this one! We’ll just have to test and see.

2 Likes

There might be some evidence to look at this assessment properly, if I once again bring in the diagram in Coulom’s paper on WHR

in which coulom notes a drop in performance between test and training data that “probably cannot be explained by overfitting alone.” While I disagree with this assessment in general, it is notable that Bayeselo is a more biased system than elo or glicko, as it assumes an unmoving “true rating” (as it was designed for AI use) and notes further that “A remarkable aspect of these results is that parameters that optimize prediction rate give a very low variance to the Wiener process. The static Bayeselo algorithm even outperformed incremental algorithms on the test set.”

It is notable that it doesn’t compare Glicko-2 which is a more variant system than Glicko-1, and WHR itself uses a similar set of assumptions as the Glicko-1 system. It’s also notable that Glicko-1, Elo, and Trueskill, while being the least accurate in the set in general, also lost the most accuracy between the training and test data, Elo losing the most, despite having few parameters to tune, possibly a result to not being as responsive to ratings changes in the test data (could be related to all the algorithms giving little merit to the random movement of true ratings, also noted by Coulom)

So it’s quite possible that more “volatile” systems perform better (WHR after all even changes previous ratings), but not necessarily systems with high parameter space

2 Likes

I’ll have to look into this paper more when I have time! I must have missed the first time it came up in discussion.

I am interested in hearing your thoughts on the second kind of bias I described (quoted again inside)

I’m wondering if this might contribute to the relatively low accuracy of all of the methods, but after staring at this for a while it’s always to get a fresh set of eyes, as I could be missing something obvious. It might be interesting to take a smaller set of data, strip the existing ranks, try to manually assign ranks that capture the outcomes in the games represented, and then compare those to the OGS ranks of those players.

Intransitivity… is a difficult topic… and I’m not particularly fond of most methods I’ve heard or come up with for trying to handle them, as they either rely on indicators that could easily “fail to become a good measure” once they are implemented as a quatity of rank (such as rated go problems), information we don’t know how to model (such as the score of a game), abstract matrix operations (I’ve once seen a system which used an interesting matrix manipulation to indicate “leadership” in legislative bodies of the US government), or starting out with random biases in certain directions and seeing what works out.

I mean, the most obvious possible attribution to low accuracy rates is that KGS is not random pairing data, but also actively works to create matchings that have winrates close to 50% (even human populations without algorithmic matchmaking tend to approximate this as they tend to dislike large skill gaps as many games with such gaps often create miserable or boring games), and is largely the purpose of many ratings systems to use that as a shortcut for finding “interesting” games

4 Likes

A very good point. I’ve been under the assumption that part of the desire is to create a ranking system which is very descriptive of individual game outcomes between any given pair of players with different, integral ranks (especially for the purpose of handicaps).

Looking back on some past comments, I see that this has been discussed a good amount in trying to generate a metric for evaluating predictive performance by way of average win rates or entropy, rather than a simple “what percentage of game outcomes does this model or system accurately describe”. This definitely has the potential to alleviate many of my worries about trying to create a model that focuses too much on capturing things like upsets, luck, and human variability. (Edit: but some of the proposed metrics do still have the potential to bring about these problems)

Worth noting that the paper above uses yet another metric:

“The method to compare algorithms consisted in measuring their prediction rates over a database of games. The prediction rate is the proportion of games whose most likely winner was predicted correctly (when two players have the same rating, this counts as 0.5 correct prediction).”

This is in the family of metrics I was envisioning when making many of my points, and is most likely to cause the issues I wanted to highlight (though I realize it may not be the one @espojaram has in mind). It also highlights another one of my implicit assumptions that the system will not be judged on whether it can predict outcomes within a single rank or players of very close ratings (which is supposed to be a coin toss anyway) – other than the fact that it should maintain a roughly even win rate among such players. In my mind, I was picturing an evaluation of predictive ability based mostly on games between players of differing ranks. It makes much more sense that the models would have closer to 50% accuracy if many of the games in the test set are between players of close rank. I realize that the paper’s metric does give half credit, but only in cases where the ranks are actually the same, whereas I was imagining no or little penalty even if the ranks are within a certain margin.

Many of my points are either very relevant or not relevant at all depending on which metric is chosen, and I apologize for not realizing and not bringing up my (very likely mistaken) assumptions earlier. I came into this thread late and from a different thread, and should have taken the time to fully read past discussion here and in other threads.

2 Likes

Also worth noting that the higher performance of the more volatile model in that paper could be related to the fact that the paper uses precisely the type of metric I was envisioning when stating that chasing predictive accuracy might favor volatile models. Yet another thing worth considering and testing.

1 Like

completely fair, I’m not a huge fan of the metric used either as a sole indicator

1 Like

I want to put forth a disclaimer: I’m aware that, even though I probably have more mathematical knowledge than a majority of human beings, I am essentially a layman in the context of this conversation, and trying to participate with my ideas kinda makes me a crackpot.

But I have a few things I’d like to hear your opinions on.



 Part 1: Analogy to the uncertainty principle for waves

TL;DR: in order to understand a wave-like function, looking at a point sample (or a narrow interval) it too little info. We need to look at the fluctuation itself to get that info.

This is inspired by this popular video. If I understood it and recall it correctly, one takeaway is that when you have a wave, say as a function of time, the less time you spend observing it, the less you can be sure of its properties (in that case wavelength).

 In our case, I believe that the “true rating” acts as a noisy wave fluctuating around a fairly stable curve over time, and for any attempt to “sample” it, there’s a lot of random noise that causes our measure to be inaccurate, which you could think of as another noisy wave being added to the “true rating”.

 What we want to do is not study the frequency spectrum of the wave, but just obtain enough experimental data to be able to know the shape of the wave as accurately as possible – but intuitively, the analogy rings true:

we need to look at more of the wave if we wish to understand it. Any single sample could be on the lower side of the wave, the upper side, or in the middle.

This brings me to two considerations:

  1. Even if what we wanted was just to “see through the noise” caused by experimental errors, intuitively we would need to try to guess where the center of the wave is, smoothing out the noise.

  2. Even if we could build a good picture of the wave, we would never know which of the fluctuations we observe are noise caused by the limits of the sampling system, or fluctuations in the actual “true rating”.

 Which leads me, intuitively, to the conclusion that we should not try to follow the high-frequency fluctuations at all, and that an approach where we try to keep the rating estimate stable at our best guess of the current “center of gravitation of the wave” is just more likely to be, on average, a good estimate of the “true rating”.



 Part 2: Why I’m skeptical of instant ratings

TL;DR: for the above reason, I believe they're bound to have a low signal-to-noise ratio.

To be clear, what I said in the last paragraph is what I had been thinking the whole time, at least since I wrote the “schmating” thing. So if you’re thinking “That’s what I’ve been saying this whole time!”, well, it just means you somehow didn’t realize I already agreed with this :laughing:

 I feel this might be another application of the bias-variance tradeoff principle (in fact, it seems to be essentially what @joooom has said multiple times): in such a system, we would be sacrificing the hope of ever accurately capturing the fluctuations of the “true rating”, but we would have less of a risk of being oversensitive to the noise caused by sources of experimental error (such as the intransitivity of player’s ability).

  •  This is the reason I’m highly skeptical of instant ratings: a rating system that updates at every game, with no memory of the shape of the apparent rating fluctuations, feels to me like it’s doomed to either be oversensitive to the noise, likely leading to pointless swirling around until it gets so stupidly far from the actual “true rating” that it’s forced to move back, or to be too slow to notice any big-picture trends in the rating.

And so in the end this is why I have a feeling that reducing volatility might actually end up improving the overall “goodness” of the system.

Then again, as was just pointed out, different metrics of “goodness” might give dramatically different “evaluations” to the same rating system.

 For example, going back to what stone_defender said, it might be better user experience for the system to catch on quickly to the player being in a slump, to help them lose less games in the moment – even if the only way to have that quick reactivity is for the system to also latch on to random noise?


Well, that’s it for now. I had other interesting thoughts, but I need time to organize them in my head, and this message is already long enough :sweat_smile:

I would like to note that a system that is designed with the intent of being used on a per-game basis should similarly account for the variance of individual games. Glicko and Glicko-2 do this to some extent, but they are also approximations of a more computation-intensive model that glickman devised (which are actually very similar to the assumptions used in WHR).

The issue with comparing it to wave forms is that there is (by the assumptions of the model) no set frequency of undulation. The progressive movements of brownian motion are uncorrelated with each other, either positively or negatively. It’s not really an issue of sampling space as if we can get a player to play infinite games in one moment of time we can theoretically get infinite precision of what their true rating is at that moment (by the law of large numbers) and do not gain or lose any information on how far or in what direction it will move (it’s a random walk), and you have a better guess on every single day after than you did prior to gaining that information (since you have a smaller initial deviation and the random walk always averages to be centered on where it is)

The main advantage of game-by-game ratings systems is that no matter how variant the system is, each data point provides some amount of indication of a true rating (even if this information rate is relatively small on a per-game basis)

The disadvantage has to do with the computation-complexity of making a good approximation of the underlying model with this data (of which the glicko and glicko-2 work best as approximations with updates of ~10-15 games, for mathematical reasons having to do with the bayesian updates and normal distributions), and the proper model often uses monte-carlo integration for smaller population size as even standard integration is often too much

1 Like

As I said, we don’t care about the frequency spectrum, we care (or I believe we should care, and that’s what I was talking about) about the center of undulation.

If brownian motion describes a Gaussian Bell over time, it means it’s moving around a mean, and it means that if you look at it from afar instead of looking at one point, you should be able to get much more information about where that mean is.

If I’m wrong about assuming this property of brownian motion, then I would very strongly start believing that it’s a bad assumption to model the fluctuations of a player’s “true rating” as brownian motion.

it’s not “moving around” a bell, think about it like this: one way to create an approximation for brownian motion in one dimension is to make a random walk – flip a coin, if it’s heads, move +1, if tails, move -1. This creates a binomial distribution at each time step which yes, on average will stay in the middle more often than head to the sides. However, once it has moved to the +1 position, now it will on average stay around +1, because it’s the same uncorrelated process but now with a new starting position

Do not confuse this with the randomness being used in the E function, which while also uncorrelated, centers around the mean of whatever the rating is and that mean does not move until the rating does

2 Likes

Ok, that makes sense, mathematically. (Edit: weird choice of words, but I still don’t know how to express what I meant :laughing:)

Hm. I’ll take time to think about this. The discrete analogy is missing both the “experimental errors” and the fact that the only information we get from matches about rating is probabilistic in nature, so I’m unclear about how to think about it.

2 Likes
Clarification and summary of past points relating to predictive accuracy and volatility, and why we might want to shy away from models like WHR

I think it’s worth summarizing once again since there’s been a lot of back and forth, though I hope that the intervening discussion has been at least partially rather than just frustrating :slight_smile: Please let me know if there still any points of contention, mistaken assumptions, or outright incorrectness in the following:

An imaginary, black-box rating system that’s exceptionally (not just decently, but exceptionally) good at predicting outcomes on a given set of games will probably be uselessly volatile.

Part of this volatility may come from overfitting, which may be ameliorated in a number of ways.

Another part of this volatility may come from the fact that it’s just hard to accurately describe the chaos of human games and uneven interactions simply by comparing a single, relatively stable number. However, it depends on how you define “accurately”, with the more concerning definition being “percentage of games in which the player of a higher rank won”, or metrics derived from similar concepts. Many of @esporajam and other’s ideas for measuring accuracy are not of this type, but it’s worth noting especially given that the WHR paper did actually use the former type of metric.

This does not mean that a model which favors relatively stable rank can’t be relatively good at making predictions (however you quantify “good”), just that trying to push predictive accuracy too far may be counterproductive. It also doesn’t mean that a highly-accurate predictive model actually exists either. I just suspect that if it did exist, it would be highly volatile in the general case.

Another key dichotomy is systems that are implemented explicitly to describe win outcomes (like WHR) vs. systems that are implemented on concepts related to game outcomes/wins but not solely based on modeling observed outcomes (like Glicko). The latter type may also do a decent job of describing win outcomes, but the former is again more what I am concerned about in terms of introducing unwanted volatility. That said, I realize that my assumption that the former kind of model was even under consideration at all may be completely wrong! (sorry for so many former and latters…)

Models also differ in whether they are allowed to update all ranks when new data is available, or just the ranks involved in the new game records. The former case is largely what I’m concerned about, again because this kind of freedom is likely to result in volatility if the metric for “success” is a simplistic calculation of predictive success. There is also the consideration of how often rank updates should occur.

All of these implementation details may result in widely varying behaviors, and overall I just want to note that if the wrong metric is used to evaluate them, a model that looks really good on paper (in a predictive sense) may be not at all the desired ranking system for OGS.

A specific example if some of these concepts, as they relate to the paper on WHR (and a question for @meili_yinhua)

The model in the WHR paper is of the type that my concerns are most relevant to:

  • It is explicitly designed to explain observed game outcomes in that it provides a maximum a posteriori estimate of the rank of each player, given a set of games
  • In its pure form (ignoring the paper’s proposal of less accurate but more incremental updates), it calculates this estimate using all games across all players
  • Its metric for success is basically just the percentage of game outcomes correctly predicted by the assigned ranks

Of particular note in this model, it’s possible that the MAP rank estimates at a given time, while being the best in a predictive sense, may overlook or misclassify certain players in the quest for global optimality. A future MAP estimate recalculated on new games (which may involve those players more often) may suddenly focus more on those players, altering their ranks more than one might expect. They do acknowledge this potentiality, and specifically construct their prior distribution “to avoid huge jumps in ratings”, but we would have to test what the MAP estimation process does to a specific player’s rank over time, especially in a chaotic server like OGS with new players, long correspondence games, large breaks in playing, etc. @meili_yinhua may be able to give a better idea of the extent of this problem and whether or not there exists any kind of upper bound on rank changes during the estimation process, as I’m still not fully familiar with the WHR method. If I’m interpreting correctly, there is an upper bound on this type of volatility, and in fact it is controllable through the w parameter used to describe Wiener process used as a prior. However, the question still remains of whether a sufficiently low w for our idea of “acceptable” volatility enables the WHR to still provide any significant gain in predictive accuracy over other types of models.

Some potentially inescapable volatility

This again goes back to the idea of “pools” of players that mostly play within themselves. It is possible that a system will assign stable, predictive intra-pool ranks. However, inter-pool interactions may then cause inescapable volatility as the relative ranks of people in both pools are adjusted.

Anecdotally, I’ve noticed on occasion that I’ll play a new (to me) player in a tournament, and find that they “feel” a bit stronger or weaker than I’d expect, even though their rank uncertainty is relatively low. After the tournament concludes, I’ll see that player’s rank stabilized to a value closer to what I might expect. This is again totally anecdotal, but it does provide potential evidence for the idea of apparent volatility stemming from different pools of players interacting suddenly and needing to stabilize relative ranks.

The WHR paper also considers this:

“For instance, if two players, A and B, enter the rating system at the same time and play many games against each other, and none against established opponents, then their relative strength will be correctly estimated, but not their strength with respect to the other players. If player A then plays against established opponents, and its rating changes, then the rating of player B should change too.”

Again, however, I believe their solution may not be what we want. In a model more similar to the current one, it might be worthwhile to consider not only rank uncertainty, but also rank uncertainty with respect to the specific players involved in a given pairing.

I will try to look at @esporajam’s post and all that came after in more detail too! I just felt it worthwhile to try and summarize and move forward with past discussion topics first.

1 Like

If you’re referring to “the ability of a player’s estimated rating to jump around” rather than their true rating jumping around, that comes down to the Ratings Deviation. And the thing is… Mathematically we have no model to justify an RD cap, we just often assume one at the “unknown player” deviation, and WHR does not even explicitly state one (although I’d imagine most programs implementing have one just for error-prevention or user-friendliness), and theoretically if given infinite time without games, volatility can increase indefinitely with the square root of time

This is a problem with any rating system that does not account for players who play with each other having ratings that correlate with each other (such as the glicko approximations) is that when playing populations meet each other when they haven’t before, their average rating compared to each other is expected to be the same, and the “diffusion” (for lack of a better term) process of rating between them until each population meets its respective average rating takes longer

1 Like

The concept of measuring “true” rank through the percentage of the population that a given player is expected to win against seems solid. As you acknowledge, though, a lot rests on the existence of P(a,b,t), which might cause issues when trying to implement this in a real-world ranking system.

I can definitely agree that assigning ranks is not a one-dimensional problem, and when I’m talking about the bias-variance tradeoff, yes I am talking about the entire system as a whole. I’d just be careful about assigning a specific dimensionality to any problem without rigorously defining what the implementation is, what exactly is being optimized, and what exactly the descriptor “multi-dimensional” is being applied to.

The tradeoff between time and frequency resolution found in signal processing is definitely interesting, but as @meili_yinhua mentions I’d also be careful about applying more than superficial parallels between that and the current rank problem. DSP and Fourier transformations are actually things I spend a lot of time with, though, so I do appreciate the desire to apply those concepts. I also do agree with your conclusion that it may be worthwhile to try and ignore the more “noisy” fluctuations in rank. This could be as simple as applying an averaging filter to the current rank graph, as I believe you have already noted.

2 Likes

In theory this is the same as what the E functions found in both Elo and Glicko do, but no model that I know of as of current accounts for any potential of performances of games close to each other being correlated without being correlated to a “true rating”