Testing the Volatility: Summary

So, there was a previous thread mostly about this, but it started out forked from another one and got very long and technical, so I expect not many people have read it or had a good idea what it was about.

I thought I would sum up the salient points here and keep going from there, so that people who don’t have much time can know where we’re at and have the opportunity to maybe jump in with topical suggestions.

And since I expect this thread to get quite long-winded and technical too, I plan to keep doing this periodically or something like that. At some point I might setup a meta-thread to link all of them together and sum them up, but for now I believe this is enough. Or maybe it’s simpler to periodically edit a summary of the recent developments into this first post, maybe make it a wiki.


First, a quick explanation of what we mean by "volatility" (click to show)

For those not in the know, when talking about volatility we’re talking about the “wild” swings in the rating that affect some players:
image
  10 kyu, 14 kyu, 15 kyu?, 11 kyu; what level is this player?

This, combined with the cultural perception that a player’s “true strength” is pretty stable over time, and with the fact that correspondence games affect the rating with a “delay”, leads many to speculate that these swings just aren’t accurate reflections of the player’s strength at any point in time.


So then, the discussion arose from @snakesss wondering how the handicap system can keep a reasonable winrate with the likely unreasonable volatility we have,[-] then me and @gennan discussing it.[-] To this, there are at least two possible explanations:

  1. in the cumulative winrate, excessive wins and excessive losses that shouldn’t happen balance each other out, leading to a misleadingly “good” winrate.
  2. the volatility in the rating system is actually succeeding in keeping track of when players are playing better or worse on a game-by-game basis, and thus the rating a player has when playing a specific game is usually correct.

I believe we all suspect (1) to be a much more likely explanation, but that’s not how scientific knowledge works. We need to perform an actual test to know for sure, by Jove!

But how do we actually test it?


@Jon_Ko pointed this out: since the the rating system we use is based on Glicko-2 and basically works by making a prediction on a user’s probability to win against another user (just based on their rating info), and then slightly adjusting the ratings based on how much the result differs from the prediction, then we can “ask” it what its prediction is, and check how accurate those predictions are game by game.[-]

He, @Allerleirauh, me and @paisley made proposals on how to quantify that accuracy. In the end, it turned out almost all of the proposals were equivalent: use “binomial deviance” to measure the accuracy of the predictions. [-] [-] [-]

@meili_yinhua provided a very good explanation of binomial deviance for dummies like me:

(Click to show explanation)

Other proposals were one by me to group up all the games where the rating system predicted a similar winrate (say, 55% for Black), and for each of these groups actually count the percentage of wins and compare it to the prediction[-]; and another by @Jon_Ko to use the mean squared error.[-]

Technically there was also a proposal, by more than one person, to compare the ratings of players who also have ratings in other sites or associations,[-] [-] but there would be no reason to assume one rating system better or worse than the other(s), so that would just mean delaying the inevitable need for a way to quantify the accuracy of each rating system. Still, if we could gather the data needed to do this, it would certainly be cool to perform the comparison.



Well, other than a few digressions and discussions about how the Glicko-2 rating system works, that’s the gist of it for now!

Now, here’s a list of aspects that I think we should discuss one by one, but I’ll write all of them together so that I don’t forget (I’ll edit more in if I think of more or if I get suggestions):

  • We need to make sure we’re actually testing our current rating system and not a “ghost” of it.
  • Even knowing what mathematical function to use to evaluate it, we need a “control” to measure it against, or I believe the measure is mostly meaningless.
  • In fact, do we actually need to implement something like my virtual timeline framework in order to do that?
  • I speculate that correspondence games probably have a much worse effect on one’s rating inaccuracy, so I think we should test them separately or something.
  • Since we’re specifically testing whether the volatility is bad, would it make sense to put special attention on the games that happened when the rating was “far from the average” (of the current player’s strength)?

I think the first point is the most urgent to discuss, so I’ll write the first reply focusing on that (which is why I deliberately left it vague).

4 Likes

So, the first thing I think we need to discuss:

We need to make sure we’re actually testing our current rating system and not a “ghost” of it.

 Here’s the thing. First off, until we get an official confirmation that the goratings repository is the code currently used in the site, we don’t actually know how exactly it works; and even if we get that confirmation, until we understand the code thoroughly, we won’t know if it actually works the same way Glicko-2 should.

If anything, we already have some evidence that it doesn’t. (see here, then here, then the PS here)

 While this is not necessarily a big issue for the practical purpose of the rating system itself, as it works fine for the main purpose of a rating system (matchmaking), I believe it might be an issue for our purpose of testing it:

 Glicko-2 is based on a specific probabilistic model, which is built around predicting winning probabilities based on a player’s parameters and then updating those parameters based on how bad those predictions were. So Glicko-2 tries to “converge” to a rating that makes its win% predictions as good as possible.

 Thus, for Glicko-2 specifically, it may make sense, for example, to test different “fine-tunings” of it to see which of them has the best accuracy, because by doing that you’re measuring how good a “tuning” is at converging to that supposedly optimal point that the probabilistic model imagines and, well, models.

But if you start changing the structure of the system itself, the Glicko-2 probabilistic model doesn’t necessarily apply anymore.
 So if we follow the Glicko-2 specifications to calculate the expected win probability for a match, but we plug in OGS ratings instead, that might be a pointless exercise, as that number might be meaningless for the OGS rating system.

 I think I might be exaggerating here, as the OGS system clearly works in a way that’s pretty similar to how Glicko-2 should work. So it must be doing something in the right ballpark, at least.

 But for testing purposes, there’s at least one drawback of this, I think: we might inadvertently end up “strawmanning” the system, because the Glicko-2 equation for expected win probability might be a slightly inaccurate approximation of the number we want, which intuitively should be based on a probabilistic model fitting the OGS system instead.

 We want to test the system, but our hypothesis is that the system is bad, and we should probably give it the best chance to prove itself right.

(In a cosmic coincidence, we might also end up “steelmanning” the system instead of “strawmanning” it, and while I don’t think it’s worth worrying about, we should at least try to just be as accurate as we can.)

1 Like

The rating graphs on a player’s profile page show the rating and how it changes due to certain wins and losses. That changes could be compared to how the open project at GitHub evaluates those wins and losses. If they behave the same, we can investigate the live code by investigating the open code.

1 Like

Maybe completely irrelevant for this topic and almost certainly not scientifically … just some associations I made.

Isn’t volatility in the eye of the beholder?
The way it is presented has a subjective feeling; it can lead you to different interpretations.
Diagram 1 and 2 impress me as rather volatile, while diagram 3 looks definitely less volatile to me.

  1. The way it is normally is shown on the OGS profile page

  1. Focussing on roughly the last month
  1. Roughly the last month with a stretched basis

Of course I realise that this line of reasoning is manipulative. It is after all the same data we are talking about.
Is there a measure (standard deviation?) that can be used to describe the volatility of an individual player and the whole OGS system of players?

Okay, just my minus 2 cents.

1 Like

@Atorrante, you know, you reply sent me down a rabbit hole of philosophical thoughts.

In the end it touches on something that I think needs, and deserves, to be discussed in a dedicated thread, really, something that I’ve been thinking about for a while.

So I’ll keep the long sketch of what I was writing as a reply until I can better organize those thoughts :laughing:

For now, here’s a little interesting tidbit that I’m not sure everybody is aware of, and I’ve personally only learned in the recent days by investigating this topic.

The following two pictures represent the rating of the same user (you) in the same interval of time (from November 21, 2022 to January 2, 2023):


but they’re in the two different “modes” the site has.

image
image

In “time-based” mode, the site seems to average out the player’s rating through an entire day, and only displays one data point for each day.

As expected, the averaging out makes for a far less shocking and perhaps more “reasonable” graph to our eyes, but it arguably hides the true extent of the volatility.

(The graphical display of the deviation – the dark blue bar – does something even weirder that I really do not understand)

I don’t know, you folks make of that what you will.

1 Like

The dark blue bar shows the user’s entire rating history. The highlighted bit (at the end in this case) is identical to the zoomed-in line graph shown above it.

1 Like

Excellent idea – though it seems to presuppose that there’s nobody we can trust to simply give us that answer in the first place :laughing:

But then again, we’d need to develop a framework to perform this kind of testing sooner or later, so I guess we might as well get to it.


I wasn’t talking about that part actually, I was talking about this:

Since in the other view the deviation is pretty much constant, I’m not sure what kind of calculation is producing this. Maybe it’s just some kind of Bezier-smoothing, but I have no idea.

Ohhh right… fascinating indeed! :slight_smile:

I believe OGS uses cubic spline Interpolation to smoothen the curve. The code for the time based ratings chart can be found here.

3 Likes

[continued from points brought up in Edit: Turns out this topic is about amending the TOS or something, hop in 🤷]

Maybe taking a step back will help. How would a true rank be defined? I’m assuming it should be based solely on existing matches so we can try to make a probabilistic model and predict future results, but there are other methods (knowledge exams, etc).

Let’s say I have a set P of all OGS players, and a relation R over P containing tuples representing every game played, where (x,y) in R indicates player x beating player y. I’ll abuse the notion of a relation to allow duplicate tuples to represent multiple games with the same result.

Given such a relation taken from OGS at some point in time, how would you define the “true” rank of each player based on the existing game record relation R?

For example, you might have this data (you can pretend there’s a lot more):

(GoLover, Player1)
(Player1, badukforever)
(Player1, badukforever)
(GoLover, atari_everything)
(atari_everything, noob_bot)
(Player1, noob_bot)
(badukforever, atari_everything)

Is it possible to stratify/rank these players in a consistent and useful manner? Maybe you could say each member of a given strata should win around 50% of games played within that same strata, and more often than that when playing lower ranks, but is it even possible to fit everyone into such a system consistently?

Now, according to the whatever definitions make up this imaginary true rank, what should be the expected outcome when Player1 challenges atari_everything? Can we assume any sort of transitivity? Further, what happens when this expectation is violated repeatedly in future games because there was no data about those two specific players challenging each other when the initial rank was constructed? Is the true rank wrong, or is volatility just a natural consequence?

Further, what happens when a new player joins? Do we make them play every other player until ranks stabilize? What about pools of people who mostly play each other that will inevitably form? What happens to the rankings when they play another player outside that pool?

Ok, I’ve been writing this for a long time, so keep in mind it’s based on your previous reply in the other thread.

I’ll expand on your idea to describe something I’ve been thinking of and hopefully circle back to the volatility question.

I don’t necessarily think this is the best way to make my final point, but it’s definitely the fun one for me.

So I present to you:

“Go Strength” is Complicated, and Why That Might Not Matter

 The latent space and how to study it

I conceptualize "Go strength" in a very similar way as you seem to,

with the difference that I believe there is potentially a way to capture it mathematically.

 It’s called a “latent space”, and in simple term it’s simply a multidimensional vector space, where each dimension is a parameter.

 I don’t believe we can really know what those parameters should be, but I’ll use something similar to the example you gave for the sake of having a concrete idea to hold on to: say one parameter is “reading ability”, another is “direction of play instinct”, another is “joseki knowledge”, and so on, say we have k parameters.

 Now, we have a population of players, and say we have a magical black box that can meaningfully measure those parameters for each player, thus assigning to each player a point in a k-dimensional space.

 We can visualize that as a cloud of points, like a scatter plot, but instead of just x and y, we have k dimensions. But since it’s probably just a blob, it doesn’t really matter.

image

Now consider a mathematical function that is able to use all of those parameters to estimate, for each point, the probability that a player with those parameters will win against another player with some arbitrary parameters.

Thus we now have, for each point in this k-dimensional space, a function that goes from a k-dimensional space to the real interval [0,1], which you can visualize as assigning a color to each point of the k-dimensional space.

image
Like this, but there’s a different picture for every point of the space

 Now let’s apply a simplistic idea: for each player/point, calculate the k-measure (the “k-dimensional volume”) of the subset of “all the players this player has more than 50% chance of winning against” (or technically of a continuous approximation of those sets of points, whatever), and divide that by the “volume” of the set of the entire population of players (continuous yadda yadda).

Now you have a simpler function, that assigns a number between [0,1] to each point (it might look similar to the fog of the second image though).

Take your finite population of players and compile a histogram based on those numbers.

Call the x coordinate of that histogram a “schmating” and repeat after me: the schmating is a meaningful quantity.

image

Now suppose I told you two players have the same schmating at a certain point in time.

They could be in different points in the cloud, and if you just knew where, you would be able to get a good estimate of the specific probability of the outcome of their match at that point in time.

 But without knowing that, and just knowing their schmating, you would have to calculate the probability by calculating some kind of weighted average of all the possible pairs of points in the cloud that have that schmating. And for every pair you consider, you don’t know which of the two each player is, so by symmetry I believe the result of the sum would be exactly 50%.

 Now if I told you two players have a different schmating, we don’t know how difficult it would be to calculate the expected probability of one winning, but I strongly expect that the probability of one randomly selected player winning against a fixed selected one would follow some kind of remarkably smooth sigmoid curve as a function of the difference in their schmating, touching 50% at 0.

But of course all of this is much too complicated to calculate in practice, so what use is that for us?

 A schmero comes to save the day

One peaceful morning,

the protagonist of our story, Mr. Schmarpad Schmelo, notices a remarkable thing about this system while schmoking his pipe: you can actually perform experiments to measure the schmating even without having any information about the k-dimensional space.

 You can pair a bunch of players together, and with statistical tools estimate their probability of winning against all the other players in the population, and from that estimate the schmating. Let’s call this estimate of the schmating a “rating”.

Then Mr. Schmelo recognizes that there are many things affecting (1) the schmating itself and (2) the accuracy of our estimate for it, aka the rating.

 As a player learns new things or some of their skills increase, the schmating also increases. Also, changes to their mood or just a momentary spur of creative inspiration can arguably make the schmating fluctuate, not only game by game, but move by move.
 And also, obviously, if the whole population gets “better” at playing but one player doesn’t, that one player’s schmating will decrease.

But still, those fluctuations should oscillate around a center, at any period of time, and without having enough information, it makes sense that the meaningful measure you want is that center of gravitation, because on average it gives you the most accurate prediction of their winrate relative to the schmating distribution.

 A player can play uncharacteristically good moves in a game (here by “good” we mean in terms of a hypothetical system that can calculate, say, the points lost or gained for any move), but on average those will be balanced out by some uncharacteristically bad moves.
 And even if they don’t within a game, the games where they play more good moves will on average balance out with the games where they play more bad moves, in the long term.

 Since you can’t expect to predict those fluctuations, the measure you really want is of the “center of gravitation” around which the schmating moves: that should give you the best estimate for the expected winrate relative to the schmating distribution (or the rating distribution).

 Also, reasonably speaking, you can’t perform enough measurements to be sure of the schmating, and “which other players a player is matched against” heavily affects how much information you have about it, so if you rely too much on the apparent info given by a single game result, you will easily end up making the rating fluctuate wildly, while the schmating probably only fluctuates around a very smooth and stable line over time (except when the player improves suddenly, say by learning a new joseki, which will likely make their schmating ramp up very fast).

So you develop a statistical model that tries to be flexible and take all of these uncertainties into account, but ideally, the objective of the system will be to guess the current “center of gravitation” as stably as possible, or just generally to have the best, on average, estimate of the expected outcome of any game based on the knowledge of the rating only.

 Then of course other people, such as Mr. Schmlickman, can try to build different measuring models, and then by performing experiments you could find out which one is the best, although the model needs to be suitable for that (so there needs to be a meaningful concept of “expected probability of winning” that you can calculate from the rating).

 So in conclusion:

 I don’t think it really matters that the “true nature of the Go strength of a human” is a complicated multidimensional monster: in the end, it is possible to compress some of that information in a one-dimensional spectrum in a meaningful way that can also be measured,

and in the end, I believe that the rating being stable and less susceptible to fluctuations makes for better winrate predictions, which is how you get better matchmaking – although this is a hypothesis that needs to and can be tested.

2 Likes

Or to compress that long thing in a few words: rating doesn’t have the conceit to measure “strength”, it bypasses that by measuring the probability of a player winning against another player based on a simplified measuring system and using statistics.

Any philosophical reflection about strength being undefinable is besides the practical point: you can modify or tweak a rating system to make it more or less accurate in a statistical sense.

I believe in the end stability of rating is desirable in that sense. We can perform experiments to test that idea.

Yes, this is a good way of thinking about it! A couple of notes:

  • The axes of a latent space are usually not the human-understandable features. Instead of using a function providing a value to each point in the feature space and looking at volumes as you mentioned, people will often instead map from the feature space into a latent space in which points are already clustered or arranged by the characteristic you’re interested in. For example, you’ve probably noted that someone with very low joseki knowledge but almost supernatural reading might be very far (in terms of Euclidean distance) from someone of the same “schmating” who has good joseki knowledge or whatever other features are involved. In order to resolve this, a mapping is learned which transforms the feature space into a latent space where these two points are automatically close to each other (but now the axes are no longer human-understandable).

  • A model based solely on statistical properties of the known win rates (even if it’s trying to approximate something like the “schmating”) will still run into the problems mentioned here (many of which you did make note of)

I do see what you’re saying, but please try to realize I am not ignoring or undervaluing your ideas; they might very well end up performing better than the current system if executed carefully. I am just trying to expand upon them, and say that any ranking system, even (actually especially) one based on observed probabilities that perfectly captures expected win rates at a given instant in time, will be subject to a good degree of volatility if you want it to maintain whatever “nice” statistical properties it has at that instant. If you are suggesting that one might favor stability of a ranking at the expense of accuracy for a given prediction, then I agree fully. In general though, it is important not to mask these problems behind added complexity – often, the problem will pop up again in a more subtle, unexpected manner.

We’re probably talking at odds here. Is it possible to make a rating system with lower volatility and higher predictive power than the current one? Most likely yes! I am just stating that the closer such a model gets to describing the win rate of past games perfectly, the more volatility it is likely have when updated to account for new games.

Another interesting thing to keep in mind is that the “optics” are also important.

 I know of at least one person who doesn’t want to play on OGS because of their perception that the rank volatility makes playing here worse than playing on other servers with a more stable rank.

 Because I agree with thinking of Go strength in very fuzzy multidimensional terms, I personally think it doesn’t make much sense, but how many players is OGS losing because of this reputation?

 Another thing, perhaps more important, is the way many players get attached to their rank and would like it to be a part of their identity, and are forced to go through the emotional roller coaster of seeing their rank rising and falling and never knowing when the rising represents a “true improvement” (meaning “long term improvement”), and feeling frustrated whenever their rank falls, even though it usually doesn’t signify a long term, or “meaningful”, decrease of their abilities.

 And just generally, as we’ve talked about and as I said in the first post in this thread, there’s a cultural perception that the rank should be pretty stable, and a lot of people will just assume that OGS’s rank being volatile means it’s bad, whether or not that’s the case.

 For these reasons, I believe that even if it did turn out that the rating volatility makes for better matchmaking, it would be much better in terms of user experience to “separate” the rank from the rating, or in other words to keep using the rating for probability estimation purposes, but have a rank that, while calculated starting from the rating, is kept artificially stable so that people can use it as a “badge” and feel that it more accurately represents their “average strength” (and that of the potential opponents).

 And maybe allow people the option to use the stable rank for the matchmaking, because in the end as long as there aren’t systematic biases, changing up the matchmaking shouldn’t affect the rating system significantly.

 As long as there is no “feedback loop”, where the rank affects the rating, I don’t really see how this could cause problems.

 But in the end, if my (and most people’s) hypothesis is right, we could actually kill two birds with one stone: it might turn out that stabilizing ranks would make for better matchmaking and make everyone happy.



 So from looking around, this is the situation I see (edit: meaning this is the impression I got so far, and I’m aware it might be wrong): most people believe the volatility of OGS’s rank is bad. Mainly, the only people who say it’s not bad are the people who might have the power to make it go away (the people who have a good understanding of how the OGS code works).

 Without evidence, it’s hard to say who’s right. The people who have the power to put together that evidence are the same people who disagree with most everyone else.

 Whenever people say “we believe the volatility is bad”, the people “with the power” seem to reply “but we believe it’s good”, but don’t provide evidence.

I fully agree with everything you say above the horizontal bar.

I don’t believe volatility is inherently good or bad, it just depends on what you want your ranking system to achieve. Maybe that’s the fact that is leading to this impression:

(in general it seems you have found that users want a stable rank, developers lean toward a rank that frequently updates to describe new data)

One of the main points I’m trying to get across is as follows:

Based on past posts, I have come to the interpretation that you want to test whether a ranking system which has lower volatility can produce better predictions about game results. This is a great goal, and probably achievable within reason.

However, past posts here and in other threads have led me to believe that you will judge such a model by how well it describes outcomes on some set of OGS games. I am just saying that, for any number of reasons which I have detailed before (elusiveness of a stable “true” rank, the need to update as new players are added or new matches occur, and the bias-variance tradeoff), one should be careful judging such a model/ranking system on how well it performs on existing data – you will likely end up increasing volatility if you push this accuracy too far, which I believe is the opposite of what you want.

Edit: found the root cause of this disagreement – it is the same as the one where you did not like this comment:

This is not a slippery slope or overly extreme hypothetical, it is simply how models like this work. There is of course a balance (and some trivial counterexamples that don’t apply to most real-world scenarios), but in general, you cannot have a model that is 100% accurate without it being more volatile than a model which is less accurate but also less volatile.

1 Like

Well, more than anything else, if it turns out that a very simple stabilizing of the rating performs better predictions, on past data, than the current algorithm, then to me that sounds like extremely strong evidence that the volatility is not helping the accuracy at all.

It doesn’t sound to me at all like that would be overfitting the past data.

– but you’re right that this is a point I hadn’t considered enough in the past. For example, if you started exploring the parameter space to tweak the rating system and measure its performance on past data, I can see how that might lead to overfitting. Intuitively, it might be related to this.

I wonder to what extent that could be solvable the same way they sometimes do in machine learning: you train the fitting on one subset of data, and you test it on another set of data.


Still, I’d like to see some evidence. One can appeal to whatever big theories they want, but until they perform an experiment that could falsify their predictions they’re mostly empty words.

My words – my belief that the current level of volatility is not useful – are empty too, and I’d like to perform a test able to falsify them.

I would of course also be happy to just see (convincing) evidence gathered by someone else. “Hey, we did perform a test like what you were talking about, here’s the results” – my impression is that the closest thing to this that has been done so far is the “lumping” winrate test, and discussing how that isn’t necessarily significant is how this thread has come about.

I’m not terribly worried about overfitting. I have only a little ML knowledge, but as I understand it the biggest source of overfitting is too-high dimensionality and too-little data.

In this case, there is only one dimension we’re looking to tune (glicko volatility) and plenty of data (thousands of users/millions of games).

I guess the hard part is getting and processing the data. I know folks have figured out data dumps in the past, but I’ve never tried to do an analysis myself.