Is rating volatility a bug or a feature? [Forked]

A post was merged into an existing topic: Why is it that handicap games are so scarce on OGS?

I’ve always imagined we used it similar to how we do ratings updates with handicap: by creating a “virtual boost” to the rating of the side taking handicap equal to whatever the stone difference is, and using that to calculate the Expected Score

3 Likes

As per usual disclaimer, I’m very far from knowing what I’m talking about here, but…
I have a feeling that if we start trying to use the tools implemented in the rating system to try to judge the rating system itself, we could easily end up doing some circular logic, and the rating system basically saying “no, I’m perfect, what are you on about?”

So if we want to devise a test on the rating system using those tools, I think we need to be very careful and sure that we’re not falling in this trap.

Anyway, I’d like to take a crack at the problem myself, but I have no idea if there’s a way to access the data about OGS games, short of continuously monitoring the public games in real time and downloading all of them.

(I’d really like if there was some kind of initiative or place where people can start learning and understanding the OGS code so that they might be able to contribute and help fix bugs eventually, but I almost feel like OGS is resigned to basically dumping all of that burden on one single person for some reason)

1 Like

Well big part of ogs code is on github online-go.com · GitHub, but i think anoek is the only one who can do anything to the rating system so its probably best to check with him about any changes. He’s been pretty busy lately with all sort of bugfixes so i guess he would appreciate all the help possible ^^

1 Like

Yes, the code is there, but I feel that it has the same “problems” as most of the other open source projects I’ve seen so far: it’s not well documented and essentially anyone who’s not familiar with the workflow and the structure is going to have quite a hard time parsing it. If they don’t have experience working on a programming project that’s vaguely similar, I think it might as well not be open source for them.

At least it’s definitely true for me; I recently went there with the specific question “what is the exact algorithm and mathematical formula used to update ranks?” (because the famous thread announcing the rating system doesn’t explain it perfectly), and I couldn’t even find where that piece of code would be in the entire repository.

(Although I also don’t have much familiarity with github itself, and generally my only experience with programming is that of a “self-taught” amateur who has only ever wrote basic scripts, usually python, for private use, so that might be a big part of why it’s so difficult for me)

I’m not complaining, since I’m sure producing the documentation and making the code well-commented enough to be accessible to amateurs like me would be an incredible undertaking in and of itself. I’m just saying I wish there was some place where people could help each other and newcomers like me wrap their heads around the code rather than leaving each person to take it on all by themselves.

I feel like if there was such a place it would be much easier to eventually expand the dev team and help out anoek, but without it there’s kind of an accessibility bottleneck.

:person_shrugging:

Anyway, all of this is off-topic :laughing:

1 Like

It’s in the repo goratings. The is also the history of rated games to test the algorithm.

2 Likes

Oh, so that’s up to date? We were wondering earlier:

This does help me with my quest of understanding the mathematics, assuming I can understand any of it, so a big thank you :slight_smile: though I’m not sure how much it helps with the previous question of retrieving, say, the probability of one player winning predicted by the rating system without having to recalculate it from the ratings.

I guess it’s always fun to implicitly be called dumb :laughing:

But since you’re helping, say I use code to download the webpage for a game (like https://online-go.com/game/XXXXXXX), where in the webpage would I find that number? And nevermind answering this specific question, how would I figure it out by myself?

Well that sounds like a problem? :laughing:

The rating system has an “opinion” about expected win rates, we can test that opinion against actual game results.

2 Likes

Yes, I agree (for a while I was skeptical about it being significant, and I guess it can still be discussed, but intuitively it does feel like if the system can predict the winrate of individual games accurately, then it must be “good” in some ways at least). I was just saying that depending on how exactly one goes about testing it, they might end up falling into some trap where they’re not actually testing what they wanted. You sometimes hear about it happening to professional researchers and mathematicians, surely it can happen to a bunch of amateurs?

Personally, I still don’t really understand how one would measure the “quality” of the betting agent in the test you proposed. What is a statistically significant result in that case?

I find it particularly complicated because the probabilities change from match to match, so you essentially need to have a sampled population before you can even build a probabilistic model to have a good idea of what to expect.
(Personally, I’m so rusty in mathematics that I couldn’t even prove that the expected amount of points for a single match is greater than 0.5 :laughing: maybe there’s a way to use the Cauchy-Schwarz inequality?)

So maybe I was worrying too much, but it feels like with your model we’d basically have sampled the actual result (and be able to calculate the wanted quantity) before we were even able to calculate what we expected, and my circular-logic-spider-sense was tingling :laughing:

1 Like

In my understanding, anoek already did that for OGS.
I did it for EGF historical data when I was involved in the update of the EGF rating system in 2019-2021.

I don’t know what you want to investigate exactly and what for, but I think rating volatility is a somewhat different question than winrate prediction.

2 Likes

Well, I think the reasoning is that OGS uses a rating system that is heavily based on winrate prediction. As far as I understand, the Glicko-2 rating system (though as of this moment I’m not sure OGS uses exactly that, but it at least uses a version of that) is basically all about reverse-engineer-updating the ranks so that the winrate can be predicted as accurately as possible.

So the question we were investigating here (in my impression) was: is the rank volatily a bug or a feature?

In other words, is the rank volatility just a random fluctuation or does it actually happen because people’s actual skills fluctuate, and thus their winrate fluctuates with them, making the rank fluctuation a better prediction of their probability of winning a specific match?

And you know what, thank you for asking this question and making me think about this, because now I’ve realized we definitely can use this to create a test (which I actually had already described earlier, vaguely, but then got lost in the details of the convo and forgot about the big picture :laughing: ).

We can use the winrate prediction based on the current rating system, and we can then perform the same test but using an averaged out/smoothed version of the rating. If the volatile rating performs better than the smoothed out one, the volatility is a feature; otherwise, it’s a bug. :v:

Well, that’s my impression. Does everyone agree? :laughing:

(Edit: also, uh, I guess this forum topic should probably be forked from the original)

2 Likes

That’s an interesting question. I haven’t looked into the EGF data at such fine grained level, but it may be difficult to investigate this. As far as I remember, the EGF data is somewhat noisy and even extracting a decent yet simple winrate prediction formula without accounting for other variables than the current rating of both players was fairly challenging.
You may need a lot more data than the ~1 million game results in the EGD to extract reliable statistics on such secondary variables.

Glicko-2 has other player variables that affect the rating updates (deviation and volatility), which may be intended to capture such player properties. But I’m no expert on Glicko-2, so I can’t say much about how to use those.

1 Like

Win rate prediction is based on rating, thus those two things surely correlate.

Mainly I just thought I could help come up with

Looking not only at the actual win rate of black/white, but the predicted one too, might be worth a try.


Imagine for a game the winning probability for black is 0.4. If you know that, you can bet 40:60 and your expected points are probablitiyOfBlackWinning * yourPointsIfBlackWins + probabilityOfWhiteWinning * yourPointsIfWhiteWins = 0.4 * 0.4 + 0.6 * 0.6 = 0.16 + 0.36 = 0.52.
If you bet 30:70 than your expected points are 0.4 * 0.3 + 0.6 * 0.7 = 0.12 + 0.42 = 0.54 :hushed: and then I realize that needs some tweaking.

0.4 * (0.4 - 0.6) + 0.6 * (0.6 - 0.4) = -0.08 + 0.12 = 0.04 and 0.4 * (0.3 - 0.7) + 0.6 * (0.7 - 0.3) = -0.16 + 0.24 = 0.08 doesn’t make it any better. :thinking:

1 Like

I’m going to argue that the thing is that this tool is exactly what we’re testing: We’re trying to depict the most accurate reflection of a user’s strength during a handicap game, and with it the ratings system’s prediction of “who is expected to win the handicap game”. I will argue this system is as natural for handicap games as using the “Expected Score” function against the result when testing even games.

1 Like

My understanding of this, is that in our analyses, only the “general trend” winrates have been used, meaning errors are allowed to cancel each other out instead of adding onto them like they would with a binomial deviance function

There are several random factors here: The first is the supposed randomness of games, which ratings systems are designed to make as close to 50% as possible as our metric of an “interesting” game without any more information. The second is the Ratings Deviation, which is essentially a measure of “how unsure are we about where your rating is right now”, and the last is the actual Glicko-2 variable volatility, which measures “how quickly is your true rating changing, thus increasing your Ratings Deviation”

1 Like

(I had actually written this part of the comment before @meili_yinhua’s reply, but it’s very topical :laughing:)

Personally, I believe a shortcoming of the Glicko-2, especially as implemented in OGS is that it doesn’t take the frequency of play into account. It’s been noted that players who play/conclude many games every day appear to be the ones with the most fragile rating (I guess this is another hypothesis we could test with sampling).

In general, Glicko-2 is usually implemented by looking at the last 15 games of a player. I’m not sure if this is still how it’s done on OGS, but it’s at least how it was done before 2021.

The problem I see with this is that it seems fundamentally different to look at the last 15 games of a players who plays, say, 4 to 20 games a week, and the last 15 games of a player who plays 100 games a week or even more. I’m pretty sure this includes deviation and volatility as defined in Glicko-2.

So for a while I’ve wanted to test whether averaging out the rating in a way that accounts both for time and quantity of games would make the rating more solid.

Now that I’ve read meili’s comment, I’m wondering if this can be applied more specifically to the Glicko-2 volatility stat in particular, i.e. is there a way to make volatility equally accurate for players that play at vastly different frequencies? (I feel like a simple way would be to extend the calculation window from 15 games to some way bigger amount, could OGS afford this? :laughing:)


After wracking my brain to remember how expected values work, I had figured this out. I also noticed that in our model it so happens that
probablitiyOfBlackWinning=yourPointsIfBlackWins
and
probabilityOfWhiteWinning=yourPointsIfWhiteWins, which means the expression really simplifies to

E = probablitiyOfBlackWinning^2 + (1 - probablitiyOfBlackWinning)^2

And I was trying to prove analytically that E > 1/2, but I couldn’t do it.

I have no idea what you were doing here :laughing:

I had the thought that, if wanting to explore and develop this method more, the question we really need to focus on and investigate is “if my guess for probablitiyOfBlackWinning is better, how exactly does that affect the result?”

This has changed to be 1-game ratings periods, and as I understand it (or so I hope) it uses special volatility calculations for games closer or further apart from each other like Lichess does

I’d imagine you’d have to write a whole new ratings system, theoretically it should fit to each player for whatever rate they play at (as opposed to glicko-1 which had it as a constant), but the effects of volatility on Ratings Deviation should already be time-dependent as-is (this is one of the things I griped about constantly during the “sliding window” system)

1 Like

Whoops, sorry, I edited the part of my comment you quoted after you had already quoted it :laughing:

By the way, since I have you here, can I ask your opinion on what I said here?