Testing the Volatility: Summary

espoojaram · January 21, 2023, 10:53am

I think we need to discuss the consequences of this, in terms of testing purposes.

So far, I’ve been assuming the obvious way to perform such a test is something like the idea I came up with for a “virtual timeline”, which is just doing the same calculations that “would happen” if the rating system was being used in real life and the same sequence of matches happened.

(Here's the original full explanation of the idea, just for reference, but you don't need to read it)

Is rating volatility a bug or a feature? [Forked]

Here’s my reasoning: what are the effects that a rating system has on the future games on the site? I’d argue that the main effects are:

it affects the matchmaking, though I’d argue all semi-decent rating systems affect the matchmaking in a very similar way, so the effects of this might be limited;

it definitely affects the handicap predicted;

it (slightly?) affects the performance of players because of the psychological effect of seeing the rank of the opponent, and being affected by being on an upswing or a downswing of the player’s own rank, and having a preconception about what one’s own rank “should” be.

At least for the moment, I don’t think we should care too much about point 3, even though I’m not sure it’s
irrelevant. I’m not sure point 1 is very relevant when it comes to testing the winrate predictions, but I should probably describe the methodology I’m thinking of, before we could discuss that.

Here’s the methodology:

Collect as many past games as you can process. For example, start from a random player, collect all of their ranked games, start going through their opponents and collect all of their ranked games, iteratively. At some point you stop and you remove from the sample all of the games where one of the players is a player that you hadn’t selected.

Hopefully now you have enough games to accurately sample the skill progression of most of the players you selected.

Sort the games in chronological order.

Now essentially simulate a virtual timeline and just run the rating system of your choice on your players (Maybe start each of them with a provisional rating equal to the OGS rating they had in the game you start with? We could try both ways out of curiosity). Obviously, at any point in the virtual timeline, only ever feed to the rating system data points from the past in the virtual timeline.

And, uh… see what happens.

I’d argue that with the exception of points 1, 2 and 3 above, this is pretty much exactly the same as if you had actually used this rating system in the real world. In fact, as a sanity check and out of curiosity, you could also run the very rating system that was actually in place and see if running it on a subset makes a difference. This could also be a good test of its solidity.

So does point 1, the matchmaking, matter? I don’t know. On the one hand, matchmaking is already imprecise in the real system by necessity, so it’s imprecise in the real one and it’s imprecise in the virtual one. If the general trend of the rating is the same, I think the difference shouldn’t be too much. Also, many games on OGS are not even automatched, and that definitely makes matchmaking slightly less of a factor.

But I would expect the average rating difference between players to be larger in the virtual system than in the real one. And in that case, the virtual rating system is expected to have stronger predictions than a real one. I genuinely don’t know if this makes for a better or a worse test of the virtual rating’s solidity though

Alright, so point 2, the handicap. In a way, I think this is only slightly worse than the matchmaking, as I imagine the main effect being that the simulated system would less often see handicap games where it thinks the handicap is “right”. But in handicap games, the performances of the players are often affected by the number of handicap stones, so this is definitely more of a problem.

So I’d say a virtual simulation should be expected to be less reliable when it comes to handicap games, although again I can’t wrap my head around whether this means it’s a better test or a worse test. Science is hard.

So, because of the way matches are chosen, most of them are between players with similar rating (using what I’m going to call the “real rating system”), which means most of the predictions in the real rating system are close to 50%; whereas in the “virtual rating system” that’s not necessarily the case.

In my original proposal I anticipated this concept and made these considerations:

Is rating volatility a bug or a feature? [Forked]

So does [the matchmaking] matter? I don’t know. On the one hand, matchmaking is already imprecise in the real system by necessity, so it’s imprecise in the real one and it’s imprecise in the virtual one. If the general trend of the rating is the same, I think the difference shouldn’t be too much. Also, many games on OGS are not even automatched, and that definitely makes matchmaking slightly less of a factor.

But I would expect the average rating difference between players to be larger in the virtual system than in the real one. And in that case, the virtual rating system is expected to have stronger predictions than a real one. I genuinely don’t know if this makes for a better or a worse test of the virtual rating’s solidity though

So, we’ve now discussed the idea that most evaluation metrics, if not all, seem to punish wrong predictions close to 50% less than wrong predictions far from 50%.

I’m trying to think about this, but I’m unable to reach a conclusion.

Do you think this gives an edge to the real rating system, because its wrong predictions, closer to 50%, are punished less? Or do you think this gives an edge to the virtual rating system, because “it’s easier to correctly predict the result if the players have very different ratings”?

Or does it not matter and whichever system gets the “edge” is just the one making better predictions?

There’s a little intuitive voice in my head saying that the answer actually depends on whether you’re using a metric that evaluates games individually, like binomial deviance, or a metric that evaluates based on percentage of correct predictions, like the WHR or the one I had come up with:

But I have no idea how to reason thoroughly about this.

joooom · January 21, 2023, 3:28pm

I’m very glad you brought this up, because this is another thing I’ve been contemplating and I similarly don’t quite know.

My suggestion at the moment would be to evaluate systems with metrics of all kinds and see what happens in practice.

Worth noting (if it hasn’t been noted already, my bad if I missed a post somewhere) that the current analysis code does this to some degree.

It shows the percentage of correct predictions, stratified into rank bands:

It also shows an average log-loss “cost” of the predictions (for each game,

is calculated where p is the ranking system’s predicted win percentage for black, and y is 1 if black wins and 0 if white wins):

However, as you mention, this is on a per-game basis and averaged across all games, so “incorrect” 50/50 predictions aren’t punished as much as they could be. A system that guesses 50/50 every time will score 0.693 by this metric – very similar to the results in my small run shown above, which makes sense given the rank stratification (maybe contradicting the “lower is better” note in this case, at least to some degree).

The fact that the database likely contains a lot of games between players of similar skill level also probably contributes to this observation. It might be interesting to see what happens when the metrics aren’t grouped by rank bands, or are grouped differently to test rank separation. For example, one could evaluate a similar metric on games in which the ranking system thinks the players are in the same integral rank (expecting roughly 50% predictive accuracy), and then again subsets of the dataset involving players that the system thinks are of significantly different ranks. Your suggestion about grouping games by predicted win rate is similar and could also be worthwhile.

Then again, it’s unclear whether any of these ideas are any better of a metric for evaluating a rating system, and as @flovo mentions, they might be problematic when there just isn’t enough data for certain players over a given time period. Again, probably worth trying all of these out just to see.

joooom · January 21, 2023, 4:01pm

On that note, we might make similar tables but where each row is stratified by rank difference (e.g. rows could have rank differences of [0, 1), [1, 2), [2, 3), etc). In a good system, you’d expect predictive accuracy to increase up from 50% across these rows. I understand that the original tables were more concerned with maintaining roughly equal win rates across different handicaps, so this might be a useful addition for comparing the old system to modified ones.

Edit: if we keep the handicap columns, you’d expect the 50% continue along the diagonal in such a table

espoojaram · January 21, 2023, 4:24pm

Funnily enough, I was so skeptical of the current level of volatility that I had actually speculated this might happen – that a system just predicting 50% all the time wouldn’t necessarily “perform” worse than the current one, at least in games between players who are reasonably close in rating

On a related note, I have to admit that when you said

I was surprised, because I had actually been thinking kind of the opposite – intuitively it’s pretty easy for any decent rating system to predict that a much stronger player is going to win, and volatility is somewhat “bounded” because, for example, if a “truly 6 kyu” player is misranked by the system as 3 kyu, they will start losing most games and the rating will naturally be brought closer to the “true rating”.

So even a volatile system can avoid “egregious” mistakes, and if you ask it “who do you expect to win?” it will almost always answer correctly for games between mismatched players.

AFAIU this is exactly what the WHR evaluation metric did, so I’m thinking it might actually be a very good metric to evaluate games between closely matched players (perhaps a “lumping” metric like the “naive” one I proposed might be even better?)

But now I’m starting to realize that what you meant was probably that the devil is in the exact winrate predicted.

If a rating system can more accurately predict the exact winrate between mismatched players, it should mean it’s matching the probabilistic model underlying the rating more accurately, and thus that the mismatched players were more accurately rated.

Anyway, I agree that we should just use all the decent metrics we can think of and hope that they all come to the same conclusion, I was thinking that from the beginning

Oh, and to come back to the question at hand (whether a metric “gives an edge” to the real rating system or to the virtual one), just to be clear, of course ideally we would verify that the metric that doesn’t give an edge to either one;

but if that isn’t the case, the second best case scenario would be if the metric gave an edge to the real system, because of scientific reasons: in our test we’re trying to “prove” that there is a better system, so if the virtual system outperforms the real one even with a metric that favors the real one, then it’s even stronger evidence.

joooom · January 21, 2023, 5:17pm

It’s important to note that this is occurring in the tables I posted above precisely because the ranking system is doing a good job at ranking people of similar skill closely, so the caveat “at least in games between players who are reasonably close in rating” is key. A system that predicts 50% all the time for everyone would be significantly worse, and obviously wouldn’t assign rankings correctly at all. Sure it would perform the same within the bands of the table above as you mentioned, but the only reason we have that table stratified at all is because of the current ranking system.

You did note this, but I want to restate it in these terms to point out why I’m suggesting we also look at predictions between ranks – because it will give a better picture of how modifying the volatility affects the system’s ability to assign predictive ranks as a whole.

This idea is probably summed up best in response to this statement:

I’m saying that it’s not necessarily even a volatile system that can avoid these mistakes, it might be especially a volatile system that can avoid these mistakes. In order to see how a less volatile system performs, we need to evaluate it on both games it thinks are evenly matched and games it thinks aren’t. Any given system will have a different idea of which games are mismatched – “mismatched” is in the eye of the system assigning rankings itself. It could happen that a less volatile system still does a pretty good job identifying people of similar skill, but is slightly less predictive when considering people who differ greatly in the ranks it assigns (which is relevant for handicaps). (It could also be more predictive, or not significantly different)

Right, and that’s part of the advantage of having some degree of volatility. I’m just saying that it’s unclear how much reducing volatility will affect the system’s ability to perform that kind of correction as necessary. It all comes down to what types of corrections are necessary and how often they occur. I still fully acknowledge that a less volatile system might be better, just want to add yet another way of comparing.

Exactly, as this is directly related to the ranks the system assigns.

espoojaram · January 21, 2023, 5:52pm

So, do you think separating matches into buckets depending on what we’ve been discussing might be useful? Specifically:

Would it be useful to subdivide matches into “matches between closely matched players” and “matches between mismatched players” given one rating system being tested?
Would it be useful to instead subdivide games into “games about which the real and virtual ratings systems have significantly different win% predictions” and “games where they have similar predictions”?

I realized this is pretty much an upgraded reformulation of the idea I had proposed here:

Which reminds me, I have to update the first post in this thread with a summary of what we’ve said so far!

I know it’s a lot of work, but since I believe I only have a superficial understanding of the arguments that have been brought so far, could I ask you to kindly rewrite the summary you wrote here, but in the form of a bullet point list? That would help me a lot

(Well, it would also be ideal to also express it as much as possible in terms that a layman could understand, but I definitely won’t blame you if you can’t )

meili_yinhua · January 21, 2023, 8:30pm

I would consider using a number of buckets, but not necessarily as a “metric of performance” but as a sort of “looking for correlations that could lead to a better system”

like we might find that certain players (maybe correlated around a rank) might have a high subjective volatility due to having a wider spread of “performance abilities” centered around a “true rating”, leading to a higher overall variance in results (My experience makes me extremely suspicious of the assumption that any two players with the same “true rating” have the same chance to win against any number of players of different ratings, despite being roughly similar, even if this assumption makes elo/glicko math simpler)

Like, as I see it, at the end of the day, there’s one metric, or potentially “pseudometric” made from a combination of multiple and human intuition, that we make decisions on. If certain buckets are “interest killers” or even a major part of out “pseudometric”, then absolutely test them, the rest are “primary research data”

espoojaram · January 22, 2023, 12:27pm

Hm, you know what, there’s a subtlety here. I believe the phenomenon I described is an advantage that comes from a volatile matchmaking more so than a volatile rating.

Some amount of mismatching gives (or is expected to give) the system more info than if the players were always matched perfectly (in terms of “schmating” – ok, this is getting ridiculous, is there really not a pre-existing technical term for that? I don’t know, “percentile rating” might be good?)

I suspect a less-reactive rating would be able to get about as much “information” (in signal-to-noise terms) from those mismatches than a volatile one.
One drawback is that it would be slower to react in case the rating estimate was very wrong, but considering what our objective is and the kind of scenarios we’re trying to prevent (i.e. most users on OGS have a fairly stable center of undulation, and we basically want to “bring that to the surface”), this might not be a problem.

(Also, intuitively, I think multiple-game ratings periods would be able to somewhat circumvent
this problem in a different way, as I talked about before (the analogy to waves). They might be able to both be unreactive to high-frequency noise and reasonably reactive to being significantly wrong.)

But this brings me back to the conundrum we were talking about: the purpose of the rating system is the matchmaking, so in practice a lot of matchmaking (both automatic and manual) is based on the rating system, which means that if the rating is less volatile, then the matchmaking, usually, will be too, which I expect will cause volatility for a different reason

I’m feeling that basically there’s a very significant trade-off between the accuracy of the matchmaking and the accuracy of the rating. If you want the rating to be more accurate, you have to sacrifice the accuracy of the matchmaking.

The other direction feels more complicated though; if you want to increase the accuracy of the matchmaking, you can’t get an accurate rating system, but if you don’t have an accurate rating system, the matchmaking will necessarily also be inaccurate.
So for maxming the matchmaking accuracy, there’s probably a sweet-spot in the trade-off.

But most relevantly for our purposes, if I’m right about this, then I think it means that the virtual rating system probably does get an edge over the real one during the testing (because the matchmaking following the real rating system causes a feedback that “confuses” the real rating system a little bit, at least in the case of instant ratings), which is the worst-case scenario

I believe me and @joooom talked about this at length when this interaction began (we talked about latent spaces, intransitivity and whatnot):

So yes, if you could have more information about the playing style of the players, having the same “true rating” (in this case defined as “percentile rating” – defined here) wouldn’t imply they have the same probability of winning against a third player, but since you don’t have access to that information, all the probabilities cancel out, giving you the same expected winrate if the percentile rating is the same;

and if you had enough data, you should expect real-world data to match that, as long as the “percentile rating” was correctly assigned to both players.

EDIT: Oh, actually, I just realized this reasoning only holds for two players who have the same percentile rating. It’s possible that this kind of reasoning might be generalized for different percentile ratings too, but the argument as it is now doesn’t work, because the symmetry I described doesn’t happen. Hmmm.

meili_yinhua · January 22, 2023, 7:56pm

I’m not even referring to intransitivity. I’m referring to “performance variance”
because let’s divide players into two groups: solid players and expiramental players. The solid players play what they know and don’t try anything beyond what they’re very confident they can’t be punished for, whereas experimental players frequently try out ideas that they’re not quite sure of, but it’s not clear that it’s bad.

Now take an experimental player and a solid player that have a 50% winrate and the same “true rating” with no other clear style difference or notable piece of intransitivity:

It seems to follow that the experimental player has a higher (albeit still less than 50%) chance of winning against a stronger player than the solid one, as the more often their ideas succeed, the better their relative performance is, whereas the solid player has a higher chance of winning games against weaker players, as they’re less likely to blunder and give the opponent chances

The idea being that the experimental player has a more variant performance, but the glicko/elo formulations assume exactly one “performance variance” across all players (which defines the elo/glicko scales)

espoojaram · January 22, 2023, 8:24pm

Hmmm. Your reasoning is too abstract for me right now.

One thing that does come to mind, at the risk of doing the obnoxious physicist thing, is that performance variance might just be considered another parameter axis in the feature/latent space, and thus what you’re referring to might simply be modeled as a specific form of intransitivity – and if it’s possible to prove that intransitivity doesn’t affect the “transitivity of the win% over percentile rating strata”, then it would follow from these two things that performance variance also doesn’t?

But even then, I’d agree that assuming one constant performance variance across all players is not a justifiable assumption per se.

meili_yinhua · January 22, 2023, 8:33pm

Mkay, so I’ll describe why I feel like this “performance variance” matters a bit to the currently existing model and updating method, and that is that the players with more variant performance relative to their “true ratings” (be it from being “experimental” or based on “form” – asumming form doesn’t correlate nearby games performance without being a function of a “true ratings change”, which would require a new layer to the model, or some other factor), that each game should provide less indication of strength or weakness to move around the “estimated rating” as would happen for less performance variant players, as upsets are more likely to happen. And if they’re assumed to have the same performance variation, they will have a much more “subjectively volatile” rating within a ratings system than those other players

joooom · January 23, 2023, 2:51pm

Very quick and dirty summary

A model that is extremely accurate in terms of raw percentage of game outcomes it predicts correctly will probably be very volatile. This is more relevant to a model like WHR whose main goal is to chase predictive accuracy, and may not matter as much for us if we’re just minorly tweaking the current system.
The metric used to evaluate “predictive performance” is very important
It’s also worth considering how and when any specific system makes changes to a player’s rank. The more often (in the extreme case, modifying everyone’s ranks after a new game outcome is encountered), the more the above points matter
A number of sources (skill variability, good and bad days, player pools interacting) may introduce volatility that’s hard to avoid

Strongly agreed. Maybe that helps add some context to the earlier statement “the key to judging a good model (if the goal is to judge one by predictive accuracy) might be how well it stratifies players whose observed win rates are significantly different than 50%.” Maintaining 50% win rates within ranks is of course important, but in order to compare one already good system to a modified version, it’s worth checking if/how these other rates change and whether it’s possible to maintain anything relatively consistent. It could also be a useless metric subject to a ton of noise, but we’ll find out. It also might be that maintaining ~50% win rates for handicap games where the rank difference equals the handicap is a good enough proxy for these measurements.

On that note, I’ve started generating some other stats from the current ranking system (the exact numbers give a good idea of some trends, but are not final as I only evaluated a small subset of the dataset):

How often the player of higher rank wins, stratified by rank difference and handicap^

How often the stronger player wins (in games with no handicap)^

To summarize, I’m just interested in seeing how these “pseudometrics” (thanks @meili_yinhua) change when evaluated on the same data for a given proposed modification to the ranking system. If there are any others you want to try out let me know.

espoojaram · January 23, 2023, 3:26pm

Oh, is that what @meili_yinhua meant with “pseudometric”? I wasn’t really sure what they were talking about

By the way, @joooom, would you mind sharing the code you used to generate those tables?

joooom · January 23, 2023, 3:36pm

I’m not sure what they had in mind specifically, but I think the general sentiment was “take a bunch of factors and look at them holistically to judge any given ranking system.”

The code is just modified versions of the scripts in the analysis folder found here: GitHub - online-go/goratings: This repository contains the (future) official rating and ranking system for online-go.com, as well as analysis code and data to develop that system and compare it to other reference systems..

espoojaram · January 23, 2023, 5:00pm

Well, unless you already had familiarity with it before doing this, this is quite discouraging for my prospects of learning my way around the code base it looks mind-numbingly complicated to me

joooom · January 23, 2023, 6:31pm

I think part of that feeling is just the nature of looking at other people’s code in general, especially research-oriented code. I’m happy to try and provide clarifications (and I imagine anoek wouldn’t mind either – as the original author he might give more insight about why things were done the way they were).

square.defender · March 25, 2023, 9:08pm

There are bad days and this is a fact. Real rank data is inevitably noisy. But, average may be useful.

don’t make rank system less precise or slow, just add ability to plot average rank graph

average is the most simple and easy way to compare your progress in no matter how chaotic data