Is rating volatility a bug or a feature? [Forked]

Holy, you’re right :neutral_face: I hadn’t realized the confidence interval was one-sided, I thought it was just a graphical glitch! (Most likely because as you say there’s a “true rank” at 6 kyu at the center of the interval)

(I’ll read your other message when I have more time; in the meantime, thanks for your contribution and for stating outright that you hadn’t read everything, I appreciate it :slight_smile:)

1 Like

This is a great explanation and the first time I’ve actually understood this “humble rank” concept. (But I think there’s no factor of two.) I also dug up what looks like the original proposal explaining what it’s supposed to be:

So I guess all the matchmaking parts are broken now leaving just the display, which is only in one or two places anyway.

1 Like

Yeah, humble rank is an arcane bit of knowledge I don’t pay much attention to but have to explain every now and then, it’s a weird middle of the road solution due to some very strongly held opinions about how glicko is “supposed” to have a certain middle-of-the-road insertion position (when really this is the result of assuming to not have information for a better informed “initial” ratings distribution)

4 Likes

IIRC the question of humble rank came up in the 2021 discussions, and it was said that humble rank was broken. I believe that GreenAsJade/Eugene, who introduced humble rank to OGS, agreed that it was broken. What, if anything, has happened to it since that time, I don’t know.

1 Like

I was just now trying to trace down my “factor of two” confusion, and now I’m confused again about humble rank. I went to a player’s page who has played no rated games, and their rank in the display area is “1150 ± 350” / “11.9k ± 4.9”. However, in the list of games, it displays as “6k”.

I had thought (and said, above), that 6k was the center estimate. Is it actually 12k? If so, what is the 6k in the games list? It doesn’t seem like a 11.9-4.9=7k, plus some rounding error. For one, that would mean it’s displaying some kind of “boastful rank,” rather than humble rank, in the games list, which seems weird. Also, my rank is “9.0k ± 1.1”, which would place my “boastful rank” at something like 7.9k, but it displays in the same game lists as 9.0k [correction: 9k].

So it does seem like there is some weird display bug somewhere. My explanation for @espoojaram’s recent example only makes sense if the 11.9k was the humble rank but, if so, why does it display as “11.9k ± 4.9” on the profile page? That clearly seems to indicate 11.9k as the central estimate, not the humble rank. But even then, if it is displaying “[rating - deviation] ± [deviation]” on the profile and “[rating]” on the games list, then why does my rank display as “9.0k ± 1.1” on my profile and “9k” in the games list? I’m pretty sure it wasn’t until my actual rank was “9.0” that it displayed that way, either, having rounded up to “10k” even when I was 9.2 on the profile.

I’m now much more confused than I was yesterday.

[Regarding the “factor of two” confusion, that came from the Glicko-2 example PDF, which suggests displaying an approximate 95% confidence interval as ± 2* RD. I had assumed that was what OGS implemented.]

[Retracted; the relevant discussion is in another thread.] Question. Without getting into display bugs, does anyone here know, outright, whether the initial rating (the actual rating, for the Glicko algorithm) we use here, is 1150, or is it or 1500?

2 Likes

We just had a discussion about this, starting here: Are OGS rankings inflated, deflated or neither? - #13 by Conrad_Melville

1 Like

This is the problem with all this stuff being split between two different threads.

1 Like

To answer your question, I no longer know. It looks like we are back to 11.9k as the starting pseudo-rank, but it is unclear whether the 6k on the thumbnails is a “leftover” (therefore a bug) or if it is used for matching.

FYI, all ranks are rounded up for anything over X.0

1 Like

Hmmm. My conjecture/interpretation, when you pointed that out, was that the rating plotted out in the profile page is a weighted average, or interpolation, of the “low boundary of the 1σ-confidence interval” (the “humble rank”) and the current Glicko-2 rating estimate, and basically moves from 100% humble rank at the beginning to 100% Glicko-2, as a function of the deviation itself.

Since ratings are considered “accurate” when the deviation reaches 160, I would guess the parameter for this interpolation slides as a function of the interval [350,160].

Something like this:
image
(sigh, I wish this platform allowed for math equations)

(EDIT: I realized the formula simplifies to this :sweat_smile:)
image

In the above formula I’m assuming that the Deviation starts out at 350 like it’s displayed in the graphs.

But even if that formula is right, that doesn’t answer why the deviation is displayed as 4.9. (It being expressed as “11.9 ±4.9” doesn’t surprise me, after all it’s what they do in a lot of scientific research too, note down a simplified symmetrical uncertainty even though the uncertainty is usually not symmetrical, and this quantity we’re talking about is essentially entirely made up)

I’d say that the true provisional rating being 1150 wouldn’t make sense given the example I brought up above (losing three games and gaining rating each time).

Also, note that 1150=1500-350, whereas 11.9 != 6 +4.9.

My guess is that the deviation is being converted into a “rank-like” quantity through the use of the usual formula, which is log-based, so it ends up doing this weird effect of shrinking the deviation.


Well, this is a very technical thread, and also the main topic of this thread is actually to devise a test to verify once and for all if the current rating volatility is hurting the handicap system and the matchmaking.

2 Likes

I plan to start looking into the ratings issues in more detail. To bump this earlier question, because I don’t see it answered, do we know if the python code found in the goratings GitHub is the actual production code? (And therefore definitely reflecting what OGS actually does.) Or is that exclusively algorithm development code?

The README on GitHub leaves it ambiguous, in my reading.

1 Like

While the question remained unanswered, this reply above from a co-developer seems to imply that it is.


Oh, I agree :laughing:

1 Like

@paisley:
I’ve finally gotten around to read you first reply to this thread :laughing:

Given how long it appeared to be, a surprisingly painless experience – even for a quasi-layman like me, it was pretty understandable.

Well, I don’t know what log-likelihood is, but I’ll take you at your word that it might be a good measure for evaluating a rating system.

Now, I think the best way to perform the test is to implement all of the decent proposals for methodologies and get many different kinds of evaluation. If they are good measures and there’s a clear-cut answer to be found, they should mostly agree, and if they don’t agree, that will tell us something weird is going on. And anyway, the decision will be left to the readers/peer-reviewers as to if any of the systems are good.

So what I mean is: for what it’s worth, if it comes down to me I’ll do my best to understand the method you proposed thoroughly and implement it. (Hopefully people more proficient than me can collaborate though :laughing:)

I believe the point you bring up about the volatility (that it should be impossible to update with 1-game rating periods) was vaguely referenced in this reply and then again in this reply, not much, but it might help you. It would also have been impossible to get specific without seeing the code, which you seem to be working on (assuming that goratings repository is the code used right now).

Before reading, I was actually planning on giving you some advice on how to format and layout a long message to make it more readable, but to be honest the vibe I’m getting in the last few days is that I have alienated most of this forum’s community with my long replies, so I guess my advice is: don’t write long replies? :laughing:

You haven’t alienated me. I enjoy reading your posts, but I didn’t spend much time on my computer or phone during the past week with Christmas dining and some outings (enjoying a week off).

3 Likes

True, I guess a lot of people would rather spend most of their time in these days with their families than in this forum. Well, whether or not my vibe-radar is correct, I still feel I have been a little “invasive”, I should probably try to respect other’s people time more :slight_smile:

1 Like

We’ve referenced it under different names: “binomial deviance”, “shannon entropy”, etc.

I like to describe it in information-theoretic terms as the “average amount of information each data point provides in the model.” Information here being considered a bad thing because it’s a measure of “surprisal”, so if each data point provides less information on average, that indicates it’s doing a better job of predicting the outcome.

Now, to explain the logarithms, I’ll give this example:
It would be reasonable to say that the information provided by a flip of a fair coin is 1 bit, and the information provided by two said flips is 2 bits, three is 3 bits, et cetera…

Now, if we think about the probabilities behind these flips of increasing information, it increases linearly with multiplicatively decreasing probabilities. That is, the information provided by n flips is exactly 1 bit more than that of n-1 flips, while the probability of a result provided by n flips (assuming the order of flips is preserved) is exactly half of that of n-1 flips.

Luckily there’s a neat little class of function that can treats multiplicative ideas as if they were additive: and that is a logarithm. So naturally we can take the probabilities (p) of the result of these multi-coin flips, and run them through -log2(p) to then achieve a function that works the way we described above.

Now, of course, so far this only seems to be useful in flipping fair coins, but to expand this idea slightly, let’s imagine the case of two fair coin-flips where order is not preserved. So 1/4 of the time it’s two heads, another 1/4 of the time it’s two tails, and 1/2 of the time it’s one head and one tail.
Naturally it makes sense that in the two heads and two tails case, that they both still provide 2 bits of information when they occur, but when there’s one head and one tail, there are two different possible outcomes. One might argue that it’s a “coin flip” for which of the 2-bit outcomes actually happened. And so we will interpret this as the 1 head 1 tail outcome recovering only 1 bit of information, as it requires one more bit of information to recover the full effect of the two coin flips.
One would also notice that it having a probability and being one bit of information would not be a coincidence, and the theory is that you can extend a similar logic to any source of information via the formula -log2(p), and this becomes much more concrete in the realm of encryption and data compression, as it correlates to file sizes after compression, or during encryption as you consider the random sources of data loss.

Now, you probably saw it stated in the form of -[Y*LOG10(E) + (1-Y)*LOG10(1-E)] somewhere.

A difference you might notice is the use of a base ten logarithm, but this is merely a scaling factor. If that doesn’t make sense I would highly recommend playing around with various exponentials and logarithms, but the rule is loga(x) = logb(x)/logb(a).

But you might also notice those Y’s, and those Y’s are supposed to be 1 for a win, 0 for a loss, and usually 1/2 for a tie (in games where these occur), So whenever you win, the second term becomes 0 and after simplifying you get -log10(E), and when you lose, the first term becomes 0, so the result becomes -log10(1-E), where E is the model’s predicted probability of you winning. So if you average this out over every game, you obtain the average amount of information that the results of games provided separate from the model’s predictions.

There’s also a couple added-in benefits such as punishing the model infinitely for having 100% certainty and being wrong, but as I hear it this is a fairly standard test for a model’s accuracy

3 Likes

This is ratings system discussion, from what I can tell most people in it either enjoy the debate or have fuming opinions on it…
I’m just paying for my sins of learning too much about the system because of prior arguments about it :stuck_out_tongue:

2 Likes

Interesting, so binomial deviance is essentially an improved version of the very first evaluating model @Jon_Ko proposed in this thread. And you’re saying it’s also the same as @paisley’s proposal, which does check out as far as I can tell from the description of his model.

Since I was also planning on learning about binomial deviance, turns out there was only one model after all :laughing:

Well, there’s still my proposed idea of just naively creating a histogram of expected win% and for every data point (which is a set of games with similar expected win%), counting the percentage of actual wins. The discrepancy between these two functions should somewhat measure the accuracy of the win predictions (although I expect there are possibilities of mutual cancellations and other weird statistical effects).

Eh, I guess trying it won’t hurt, but I’m now convinced of the validity of binomial deviance as an evaluating function.

Excellent explanation by the way. I felt a little “babysat” while reading it, but in my ignorance I probably wouldn’t have understood you otherwise :laughing:

Edit: by the way, before I go around spreading misinformation, can anyone confirm to me that the Deviation in Glicko represents the (estimated) standard deviation of a Gaussian distribution, thus marking a confidence of about 68%?

1 Like

My apologies, if I don’t put my mindset in a pedagogical method I often don’t give enough detail to understand half of what I’m saying

That is the assumption that the Glicko systems as described make relative to glickman’s underpinning model, yes.

(To be more specific, the Glicko and Glicko-2 systems assume a Normal prior distribution, and that a bayesian inference on a normally distributed ratings distribution based on the result of a game will continue to approximate a Normal distribution, alongside the underlying model making the assumption that the movement of “true ratings” is Brownian Motion, which makes a normal distribution over time)

2 Likes

Yeah this looks close to right, you can see the code here

Then I guess you can look at the other repository for the ratings code itself.

2 Likes

Well, I’ll take this opportunity to say that I usually don’t like Gaussian distributions, and that I pettily refuse to call them “Normal” as I believe they’re far less common in nature than they’re thought to be. I think many times people just lazily say that a distribution is Gaussian when it’s any bell distribution, because the Gaussian distribution has been mathematically studied thoroughly and is very convenient.

But of course as long as they produce a model effective for a specific practical purpose, such as the Elo-spawned family of rating systems, I don’t care if it’s “wrong in theory”.

Anyway, digression:

I was initially confused by this, but I guess it makes sense.

Essentially, if I understand correctly, there’s a hypothesized distribution of the “true ratings” at any point in time of the entire population of players, and we assume it to be Gaussian. But then for each individual player there’s a distribution of “effective skill” that they might randomly showcase at any point in time, and (I guess because of the Brownian thing) we “assume” that to be Gaussian too.

So in practice at the start, when you don’t know anything about the player, you can only sample them from the bigger distribution - so the initial deviation is really the deviation of the distribution of players. But as you get more info on the individual player, you start developing a model of their own distribution of strength, and the modeled distribution somehow “interpolates” from the distribution of players to a “good”, hopefully, model of the individual player’s strength.

So the “Deviation” is, like, the composite of the uncertainty about where the player’s “central rating” is, and the estimation of the Gaussian distribution resulting from the Brownian Motion. I say “composite”, I imagine it’s the square root of a sum of variances.

(Some day I’ll have to get to actually studying the glicko papers, instead of pestering you :laughing:)

1 Like