Testing the Volatility: Summary

@Atorrante, you know, you reply sent me down a rabbit hole of philosophical thoughts.

In the end it touches on something that I think needs, and deserves, to be discussed in a dedicated thread, really, something that I’ve been thinking about for a while.

So I’ll keep the long sketch of what I was writing as a reply until I can better organize those thoughts :laughing:

For now, here’s a little interesting tidbit that I’m not sure everybody is aware of, and I’ve personally only learned in the recent days by investigating this topic.

The following two pictures represent the rating of the same user (you) in the same interval of time (from November 21, 2022 to January 2, 2023):


but they’re in the two different “modes” the site has.

image
image

In “time-based” mode, the site seems to average out the player’s rating through an entire day, and only displays one data point for each day.

As expected, the averaging out makes for a far less shocking and perhaps more “reasonable” graph to our eyes, but it arguably hides the true extent of the volatility.

(The graphical display of the deviation – the dark blue bar – does something even weirder that I really do not understand)

I don’t know, you folks make of that what you will.

1 Like

The dark blue bar shows the user’s entire rating history. The highlighted bit (at the end in this case) is identical to the zoomed-in line graph shown above it.

1 Like

Excellent idea – though it seems to presuppose that there’s nobody we can trust to simply give us that answer in the first place :laughing:

But then again, we’d need to develop a framework to perform this kind of testing sooner or later, so I guess we might as well get to it.


I wasn’t talking about that part actually, I was talking about this:

Since in the other view the deviation is pretty much constant, I’m not sure what kind of calculation is producing this. Maybe it’s just some kind of Bezier-smoothing, but I have no idea.

Ohhh right… fascinating indeed! :slight_smile:

I believe OGS uses cubic spline Interpolation to smoothen the curve. The code for the time based ratings chart can be found here.

3 Likes

[continued from points brought up in Edit: Turns out this topic is about amending the TOS or something, hop in 🤷]

Maybe taking a step back will help. How would a true rank be defined? I’m assuming it should be based solely on existing matches so we can try to make a probabilistic model and predict future results, but there are other methods (knowledge exams, etc).

Let’s say I have a set P of all OGS players, and a relation R over P containing tuples representing every game played, where (x,y) in R indicates player x beating player y. I’ll abuse the notion of a relation to allow duplicate tuples to represent multiple games with the same result.

Given such a relation taken from OGS at some point in time, how would you define the “true” rank of each player based on the existing game record relation R?

For example, you might have this data (you can pretend there’s a lot more):

(GoLover, Player1)
(Player1, badukforever)
(Player1, badukforever)
(GoLover, atari_everything)
(atari_everything, noob_bot)
(Player1, noob_bot)
(badukforever, atari_everything)

Is it possible to stratify/rank these players in a consistent and useful manner? Maybe you could say each member of a given strata should win around 50% of games played within that same strata, and more often than that when playing lower ranks, but is it even possible to fit everyone into such a system consistently?

Now, according to the whatever definitions make up this imaginary true rank, what should be the expected outcome when Player1 challenges atari_everything? Can we assume any sort of transitivity? Further, what happens when this expectation is violated repeatedly in future games because there was no data about those two specific players challenging each other when the initial rank was constructed? Is the true rank wrong, or is volatility just a natural consequence?

Further, what happens when a new player joins? Do we make them play every other player until ranks stabilize? What about pools of people who mostly play each other that will inevitably form? What happens to the rankings when they play another player outside that pool?

Ok, I’ve been writing this for a long time, so keep in mind it’s based on your previous reply in the other thread.

I’ll expand on your idea to describe something I’ve been thinking of and hopefully circle back to the volatility question.

I don’t necessarily think this is the best way to make my final point, but it’s definitely the fun one for me.

So I present to you:

“Go Strength” is Complicated, and Why That Might Not Matter

 

 The latent space and how to study it

I conceptualize "Go strength" in a very similar way as you seem to,

with the difference that I believe there is potentially a way to capture it mathematically.

 It’s called a “latent space”, and in simple term it’s simply a multidimensional vector space, where each dimension is a parameter.

 I don’t believe we can really know what those parameters should be, but I’ll use something similar to the example you gave for the sake of having a concrete idea to hold on to: say one parameter is “reading ability”, another is “direction of play instinct”, another is “joseki knowledge”, and so on, say we have k parameters.

 Now, we have a population of players, and say we have a magical black box that can meaningfully measure those parameters for each player, thus assigning to each player a point in a k-dimensional space.

 We can visualize that as a cloud of points, like a scatter plot, but instead of just x and y, we have k dimensions. But since it’s probably just a blob, it doesn’t really matter.

image

Now consider a mathematical function that is able to use all of those parameters to estimate, for each point, the probability that a player with those parameters will win against another player with some arbitrary parameters.

Thus we now have, for each point in this k-dimensional space, a function that goes from a k-dimensional space to the real interval [0,1], which you can visualize as assigning a color to each point of the k-dimensional space.

image
Like this, but there’s a different picture for every point of the space

 Now let’s apply a simplistic idea: for each player/point, calculate the k-measure (the “k-dimensional volume”) of the subset of “all the players this player has more than 50% chance of winning against” (or technically of a continuous approximation of those sets of points, whatever), and divide that by the “volume” of the set of the entire population of players (continuous yadda yadda).

Now you have a simpler function, that assigns a number between [0,1] to each point (it might look similar to the fog of the second image though).

Take your finite population of players and compile a histogram based on those numbers.

Call the x coordinate of that histogram a “schmating” and repeat after me: the schmating is a meaningful quantity.

image

Now suppose I told you two players have the same schmating at a certain point in time.

They could be in different points in the cloud, and if you just knew where, you would be able to get a good estimate of the specific probability of the outcome of their match at that point in time.

 But without knowing that, and just knowing their schmating, you would have to calculate the probability by calculating some kind of weighted average of all the possible pairs of points in the cloud that have that schmating. And for every pair you consider, you don’t know which of the two each player is, so by symmetry I believe the result of the sum would be exactly 50%.

 Now if I told you two players have a different schmating, we don’t know how difficult it would be to calculate the expected probability of one winning, but I strongly expect that the probability of one randomly selected player winning against a fixed selected one would follow some kind of remarkably smooth sigmoid curve as a function of the difference in their schmating, touching 50% at 0.

But of course all of this is much too complicated to calculate in practice, so what use is that for us?

 A schmero comes to save the day

One peaceful morning,

the protagonist of our story, Mr. Schmarpad Schmelo, notices a remarkable thing about this system while schmoking his pipe: you can actually perform experiments to measure the schmating even without having any information about the k-dimensional space.

 You can pair a bunch of players together, and with statistical tools estimate their probability of winning against all the other players in the population, and from that estimate the schmating. Let’s call this estimate of the schmating a “rating”.

Then Mr. Schmelo recognizes that there are many things affecting (1) the schmating itself and (2) the accuracy of our estimate for it, aka the rating.

 As a player learns new things or some of their skills increase, the schmating also increases. Also, changes to their mood or just a momentary spur of creative inspiration can arguably make the schmating fluctuate, not only game by game, but move by move.
 And also, obviously, if the whole population gets “better” at playing but one player doesn’t, that one player’s schmating will decrease.

But still, those fluctuations should oscillate around a center, at any period of time, and without having enough information, it makes sense that the meaningful measure you want is that center of gravitation, because on average it gives you the most accurate prediction of their winrate relative to the schmating distribution.

 A player can play uncharacteristically good moves in a game (here by “good” we mean in terms of a hypothetical system that can calculate, say, the points lost or gained for any move), but on average those will be balanced out by some uncharacteristically bad moves.
 And even if they don’t within a game, the games where they play more good moves will on average balance out with the games where they play more bad moves, in the long term.

 Since you can’t expect to predict those fluctuations, the measure you really want is of the “center of gravitation” around which the schmating moves: that should give you the best estimate for the expected winrate relative to the schmating distribution (or the rating distribution).

 Also, reasonably speaking, you can’t perform enough measurements to be sure of the schmating, and “which other players a player is matched against” heavily affects how much information you have about it, so if you rely too much on the apparent info given by a single game result, you will easily end up making the rating fluctuate wildly, while the schmating probably only fluctuates around a very smooth and stable line over time (except when the player improves suddenly, say by learning a new joseki, which will likely make their schmating ramp up very fast).

So you develop a statistical model that tries to be flexible and take all of these uncertainties into account, but ideally, the objective of the system will be to guess the current “center of gravitation” as stably as possible, or just generally to have the best, on average, estimate of the expected outcome of any game based on the knowledge of the rating only.

 Then of course other people, such as Mr. Schmlickman, can try to build different measuring models, and then by performing experiments you could find out which one is the best, although the model needs to be suitable for that (so there needs to be a meaningful concept of “expected probability of winning” that you can calculate from the rating).

 So in conclusion:

 I don’t think it really matters that the “true nature of the Go strength of a human” is a complicated multidimensional monster: in the end, it is possible to compress some of that information in a one-dimensional spectrum in a meaningful way that can also be measured,

and in the end, I believe that the rating being stable and less susceptible to fluctuations makes for better winrate predictions, which is how you get better matchmaking – although this is a hypothesis that needs to and can be tested.

2 Likes

Or to compress that long thing in a few words: rating doesn’t have the conceit to measure “strength”, it bypasses that by measuring the probability of a player winning against another player based on a simplified measuring system and using statistics.

Any philosophical reflection about strength being undefinable is besides the practical point: you can modify or tweak a rating system to make it more or less accurate in a statistical sense.

I believe in the end stability of rating is desirable in that sense. We can perform experiments to test that idea.

Yes, this is a good way of thinking about it! A couple of notes:

  • The axes of a latent space are usually not the human-understandable features. Instead of using a function providing a value to each point in the feature space and looking at volumes as you mentioned, people will often instead map from the feature space into a latent space in which points are already clustered or arranged by the characteristic you’re interested in. For example, you’ve probably noted that someone with very low joseki knowledge but almost supernatural reading might be very far (in terms of Euclidean distance) from someone of the same “schmating” who has good joseki knowledge or whatever other features are involved. In order to resolve this, a mapping is learned which transforms the feature space into a latent space where these two points are automatically close to each other (but now the axes are no longer human-understandable).

  • A model based solely on statistical properties of the known win rates (even if it’s trying to approximate something like the “schmating”) will still run into the problems mentioned here (many of which you did make note of)

I do see what you’re saying, but please try to realize I am not ignoring or undervaluing your ideas; they might very well end up performing better than the current system if executed carefully. I am just trying to expand upon them, and say that any ranking system, even (actually especially) one based on observed probabilities that perfectly captures expected win rates at a given instant in time, will be subject to a good degree of volatility if you want it to maintain whatever “nice” statistical properties it has at that instant. If you are suggesting that one might favor stability of a ranking at the expense of accuracy for a given prediction, then I agree fully. In general though, it is important not to mask these problems behind added complexity – often, the problem will pop up again in a more subtle, unexpected manner.

We’re probably talking at odds here. Is it possible to make a rating system with lower volatility and higher predictive power than the current one? Most likely yes! I am just stating that the closer such a model gets to describing the win rate of past games perfectly, the more volatility it is likely have when updated to account for new games.

Another interesting thing to keep in mind is that the “optics” are also important.

 I know of at least one person who doesn’t want to play on OGS because of their perception that the rank volatility makes playing here worse than playing on other servers with a more stable rank.

 Because I agree with thinking of Go strength in very fuzzy multidimensional terms, I personally think it doesn’t make much sense, but how many players is OGS losing because of this reputation?

 Another thing, perhaps more important, is the way many players get attached to their rank and would like it to be a part of their identity, and are forced to go through the emotional roller coaster of seeing their rank rising and falling and never knowing when the rising represents a “true improvement” (meaning “long term improvement”), and feeling frustrated whenever their rank falls, even though it usually doesn’t signify a long term, or “meaningful”, decrease of their abilities.

 And just generally, as we’ve talked about and as I said in the first post in this thread, there’s a cultural perception that the rank should be pretty stable, and a lot of people will just assume that OGS’s rank being volatile means it’s bad, whether or not that’s the case.

 For these reasons, I believe that even if it did turn out that the rating volatility makes for better matchmaking, it would be much better in terms of user experience to “separate” the rank from the rating, or in other words to keep using the rating for probability estimation purposes, but have a rank that, while calculated starting from the rating, is kept artificially stable so that people can use it as a “badge” and feel that it more accurately represents their “average strength” (and that of the potential opponents).

 And maybe allow people the option to use the stable rank for the matchmaking, because in the end as long as there aren’t systematic biases, changing up the matchmaking shouldn’t affect the rating system significantly.

 As long as there is no “feedback loop”, where the rank affects the rating, I don’t really see how this could cause problems.

 But in the end, if my (and most people’s) hypothesis is right, we could actually kill two birds with one stone: it might turn out that stabilizing ranks would make for better matchmaking and make everyone happy.
 


 
 So from looking around, this is the situation I see (edit: meaning this is the impression I got so far, and I’m aware it might be wrong): most people believe the volatility of OGS’s rank is bad. Mainly, the only people who say it’s not bad are the people who might have the power to make it go away (the people who have a good understanding of how the OGS code works).

 Without evidence, it’s hard to say who’s right. The people who have the power to put together that evidence are the same people who disagree with most everyone else.

 Whenever people say “we believe the volatility is bad”, the people “with the power” seem to reply “but we believe it’s good”, but don’t provide evidence.

I fully agree with everything you say above the horizontal bar.

I don’t believe volatility is inherently good or bad, it just depends on what you want your ranking system to achieve. Maybe that’s the fact that is leading to this impression:

(in general it seems you have found that users want a stable rank, developers lean toward a rank that frequently updates to describe new data)

One of the main points I’m trying to get across is as follows:

Based on past posts, I have come to the interpretation that you want to test whether a ranking system which has lower volatility can produce better predictions about game results. This is a great goal, and probably achievable within reason.

However, past posts here and in other threads have led me to believe that you will judge such a model by how well it describes outcomes on some set of OGS games. I am just saying that, for any number of reasons which I have detailed before (elusiveness of a stable “true” rank, the need to update as new players are added or new matches occur, and the bias-variance tradeoff), one should be careful judging such a model/ranking system on how well it performs on existing data – you will likely end up increasing volatility if you push this accuracy too far, which I believe is the opposite of what you want.

Edit: found the root cause of this disagreement – it is the same as the one where you did not like this comment:

This is not a slippery slope or overly extreme hypothetical, it is simply how models like this work. There is of course a balance (and some trivial counterexamples that don’t apply to most real-world scenarios), but in general, you cannot have a model that is 100% accurate without it being more volatile than a model which is less accurate but also less volatile.

1 Like

Well, more than anything else, if it turns out that a very simple stabilizing of the rating performs better predictions, on past data, than the current algorithm, then to me that sounds like extremely strong evidence that the volatility is not helping the accuracy at all.

It doesn’t sound to me at all like that would be overfitting the past data.

– but you’re right that this is a point I hadn’t considered enough in the past. For example, if you started exploring the parameter space to tweak the rating system and measure its performance on past data, I can see how that might lead to overfitting. Intuitively, it might be related to this.

I wonder to what extent that could be solvable the same way they sometimes do in machine learning: you train the fitting on one subset of data, and you test it on another set of data.


Still, I’d like to see some evidence. One can appeal to whatever big theories they want, but until they perform an experiment that could falsify their predictions they’re mostly empty words.

My words – my belief that the current level of volatility is not useful – are empty too, and I’d like to perform a test able to falsify them.

I would of course also be happy to just see (convincing) evidence gathered by someone else. “Hey, we did perform a test like what you were talking about, here’s the results” – my impression is that the closest thing to this that has been done so far is the “lumping” winrate test, and discussing how that isn’t necessarily significant is how this thread has come about.

I’m not terribly worried about overfitting. I have only a little ML knowledge, but as I understand it the biggest source of overfitting is too-high dimensionality and too-little data.

In this case, there is only one dimension we’re looking to tune (glicko volatility) and plenty of data (thousands of users/millions of games).

I guess the hard part is getting and processing the data. I know folks have figured out data dumps in the past, but I’ve never tried to do an analysis myself.

it tends to be caused by this, but the nature of the bias-variance tradeoff is that it’s a consideration when you’re considering any dimensionality at all, for example adding a quadratic term to what was previously a linear best-fit analysis creates a higher variance scenario whereby the quadratic model may cause an overfitting of the model to the data that has been explored, whereas a linear regression might not capture the true relationship on either the training data or the test data, and in such case would be “biased”

One way to test a comparison of models for overfitting is to divide data into “training data” and “test data” (in ratings systems this is often done by having the test data come later due to their time-dependent nature), whereby you tune all the parameters to the training data, and then evaluate them based on the test data.

Now, I’ve been told that due to the matchmaking system’s reliance on the model, that not incorporating this can still miss some of the overfitting, but that’s a whole other topic

Now, I might ask: what exactly are you looking to tune in terms of the volatility? The initial parameter? The tau constant? its time-based relationship with Ratings Deviation?

From what I can tell is the OP wants to test the subjective volatility, and is considering creating a new system on top of the existing Glicko model (thereby creating a new model) and as such you’d need to tune all the Glicko parameters to the training data in both the unmodified form and its modified forms in order to properly gain a comparison, plus any parameters added to the top of it (such as any coefficients used in an averaging of previous ratings to gain an idea of current rating)

2 Likes

We could:

  1. Come up with some of those parameters, “reading”, “fuseki”, “direction of play”, “estimating the value of moves”, “estimating the score”, “making good shape”, “invasion”, “reduction”, and so on. Some parameters may be redundant/overlapping, that’s ok.

  2. With the help of a professional player, come up with a set of problems meant to evaluate all those parameters.

  3. Put all these problems in a form, along with the questions “What is your name and what national go association are you a part of”, plus “On which internet server do you play regularly and what is your name on that server”

  4. Use the problems to position each participant in the feature space, and use the records of the national associations and internet servers to find who wins against who.

2 Likes

Basically. In fact, funnily enough, Glicko-2 volatility is one parameter I don’t necessarily care about, since if I understand correctly it’s implemented in a bit of a weird way in the current system – it does seem marginally useful, but not too much. I feel that we could ignore it and the system would work almost entirely the same way.

 I don’t know how to explain this, but basically: people don’t like the perceived volatility, and it might be bad, and if that’s true we want to “fix” it.

 Then, I’m not sure exactly what exactly the role of volatility is in the real Glicko-2 model, but AFAIU, in the current OGS rating system, it’s exactly the opposite of what we want: the “volatility” parameter (or the parameter substituting Glicko volatility) measures how “spread out” the recent rating has been, and if it’s really spread out, it makes it even more volatile by increasing the deviation.
 (I think based on the reasoning that a high “spread-out-ness” means either that the player is quickly getting better or worse at the game, or that the rating is otherwise very inaccurate and needs to be adjusted quickly.)



On a related note, I have yet to finish reading the Glicko papers, but the impression I got so far is that the model isn't very useful for the use case of an online platform such as OGS, unless implemented in some specific way.

My understanding is that the Glicko models were built to fit the situation where:

  • players don’t play many games
  • they take the games they play quite seriously (focusing on winning and not on learning or teaching, for example)
  • the games are played in clumped-together groups but otherwise somewhat spread out in time (such as real life tournaments).
The second issue (the assumption that players focus on winning more than having fun) is in practice unsolvable I believe (not just when trying to adapt Glicko to OGS, but for any rating system)

, since this is an amateur platform where people come to have fun (which includes but is very much not limited to competing and testing their skills), so we’ll probably just have to accept that the rating is going to suffer from that – there might be weird ways to parametrize that too, but as we’re discussing, the simpler the better.

 By the way, in the recent threads people seem to keep telling me this is not true, but nobody has brought any convincing argument in support of that, or against mine: in a situation where players focus on doing their best to win and doing that the best way they know how, the “strength” of their performance should be much more stable than a situation where they’re just having fun or exploring new possibilities.
 Any exotic argument such as “you might inadvertently stumble into better moves when you’re playing in a relaxed way” is completely irrelevant, because most of the time that is not the case. If you keep your playing style stable, you’re expected to have a more stable performance. So changing your playing style when playing against weaker players, for example, is going to give a harder time to the rating system.

The clumped-together aspect might be adaptable to OGS by only updating the rating, say, once a day, using a rolling window system with an unfixed window size (equal to the number of games that day). If you want to satisfy the player’s curiosity to see an immediate improvement, you could have a separated “temporary” rating that updates game by game, but without affecting the “true” daily rating update.

I think this assessment, while arguably being the case (although in reality it doesn’t account for how much your rating has moved, since it doesn’t move during a ratings period), implies a positive feedback loop where it does not clearly exist (we do have volatilities that settle at a reasonable level)

Only the last is really important for the model (as that’s what makes ratings periods easy), and this is why I don’t complain too much about instant-time ratings periods on OGS even if they are higher variance

EDIT: I suppose I might as well justify the “taking games seriously”, and this actually goes back to the original Elo assumption that your performance during a game is already a distribution that includes taking it more seriously than not, although one could argue that different players have different levels of variance upon this scale, and accounting for such might suggest the need for another modification to the E function

I’m curious what exactly this would garner in advantage over a simple daily updating period, as it runs the risk of having a similar problem as the “15 ratings” problem of the previous rolling window solution