It seems the average rating displayed can be wildly different than their strength in that time control. Am I wrong?
It’s not skewed, you have different ratings for different board sized and time controls. See below:
Overall* rating. It is not an average of your other ratings, it is its own unique pool and you should not expect anything more than vague correlation between any two ratings. They are of distinct pools and thus cannot be compared between each other. Only comparing with the same rating of another person is possible.
This makes no sense at all. The example given has an overall rating higher than any component. There is no way that can happen with Glicko ratings unless there is an error or more stupidly the overall pool does not use all the games from the other pools. I have been watching my own rating changes for all categories and there is clearly a bug in the way a new result changes the rating. I get a good result and a rating goes down but not to worry my next poor result will make it go up again. It is abundantly clear that storage and reference of ratings is awry. As an ex-professional program debugger I know I could find the bug if I got access to the code.
I saw somewhere the suggestion that your rating can go down as it is re-evaluated for all your past opponent rating changes rather that what really should happen in that your rating being based on the rating of the opponent WHEN you play them. Imagine how slow it would be to recreate all ratings of every person for all categories after a single game is played. No it does not happen that way and few programmers would even relish trying such a stunt let alone. Imagine the possibilities - two young players play games against each other - one ceases for a while the other becomes a 1d. Both will now show as 1d. Nah!
This is normal. While I see this is your first time on the forums (welcome!!) I’d recommend doing some research on how the ratings are calculated and what they mean before insisting so vehemently that you’ve discovered a bug. The overall rating is calculated from all games you’ve played - it is not an amalgamated rating from the individual time/size ratings. Those are displayed purely for curiosity’s sake, and you cannot compare any sub-ratings with your overall rating. In the end, they have very little meaning beyond comparing your skill in a given sub-rating to someone else’s, assuming you’ve both played at least a few games in that rating class.
As for a rating going down after a win, this is possible, and a known side effect of this particular implementation of the glicko2 rating algorithm (where the rating period is a sliding 15-game window.) Due to this it can create the seemingly baffling effect where you can lose rating points after a win, depending on how that affects your pre-window rating that the final rating calculations are based off of. I’d recommend reading through Professor Glickman’s paper to see how ϕ’ for a new rating is based on ϕ* - the previous rating before the window, and note that unlike the specific example in the paper this is a sliding per-player window on OGS to account for the way games are played sporadically as opposed to, say, an organized professional league. Unfortunately, accuracy and intuitiveness are not intrinsically linked in mathematics, despite what some may expect.
Now, anoek has noted previously that this, while seemingly a more accurate way to determine ratings based on testing, is a suboptimal user experience. As such, he’s expressed that this particular aspect of the rating system is something he’s looking at slightly tweaking. See thread here:
Also, if you’d like to look at the implementation, have at it! In the previously linked thread, anoek is clear that he’s widely open for suggestions that can demonstrably improve accuracy, but please, before you get too far in the weeds, do familiarize yourself with the details of the current implementation, lest you fall down the wrong rabbit hole. This repository solely houses the calculation code and analysis for single-day windows and one-game-at-a-time windows, as opposed to the sliding 15-game window used in production for newly played games. I’m also not sure if there have been changes on OGS that haven’t made it into that repo, so, grain of salt and whatnot.
Many thanks for the long and patient reply. You are of course correct both in your explanation of the rating system and my over vehement critic. Nothing I am seeing is inconsistent with the employed system. My apologies.
I have noticed two of my games won by timeouts toward the end of what were clearly won games caused no rating movement whatsoever in any field. Am I imagining something?
As you are only playing correspondence, some timeouts may be annulled.
If a player times out of all their games, only the first game counts towards ranking, the rest are not counted towards rank, though there is no indication as such. If your last correspondence game ended with you losing by timeout, all subsequent correspondence games you time out of will not affect your rank, until you end a game by scoring or resignation.
What happened, was your opponents had timed out of several games, and your game didn’t count towards ranking. This can be unfairly exploited, so that you don’t lose rank if you time out of a clearly lost game, but if that happens, you can call a moderator. Be sure that they were doing it with deliberate malicious intent before you report them, though, as falsely reporting someone is not good,
The ranking system is also kind of broken, but this is the best we can do.
About the overall ranking being higher than all the others, take the following extreme case.
The last 15 games were all wins against 9 dans. 5 on 9x9, 5 on 13x13, 5 on 19x19. The 30 games before that were all losses against 25 kyus. 10 on 9x9, 10 on 13x13, 10 on 19x19. The rating system would take a 15 game window for each.
The 9x9, 13x13, and 19x19 calculations would take into account that you beat 5 9 dans, but lost to 10 25ks, giving you an average rank of 20k. This rank happens for very complicated reasons, but the amount of rating gained from a 20k beating a 9d is the same as a 20k beating a 5k.
Your 9x9, 13x13, and 19x19 ranks are each ~1200. This is low.
Your overall rating only sees the last 15 games, which were all wins against 9 dans. As such, it makes your rating ~3100.
I hope this answers a few questions.
Many thanks for your response. I had guessed as much. On timeouts I had assumed, as in chess that time is of the essence, and have experienced such over 50 yrs of correspondence chess. I can however understand the desire to not skew the ratings by mass timeouts for non playing reasons.
It would strike me as relatively simple to make the decision of when to rate a timed out game by a quick evaluation by the AI (number of moves made would be simpler albeit cruder). If the AI evaluated a timed out game to be better than a certain probability for the winner then rank it. This would require an additional/alternate test when deciding on whether to rank or not. My gut feel is that this would not require an inordinate amount of work but of course I could be wrong if the AI interface made it difficult.
This was debated and there was general antipathy towards AI deciding games when humans, especially lower ranked humans, will have much larger swings.
I wouldn’t be opposed to just ranking all timeouts as losses on time with all that that entails; the rank can be regained, but that’s just my view, and I don’t think (though I could be wrong) that I’m in the majority on this particular point.
I think big issues arise though if a player times out of a lot of correspondence games. There’s a rank inflation of possibly hundreds of players. This would be the case particular if they happened to be a high rated player (like a few of the high 5-9d around that seemed to play correspondence) or in general if a player happened to be playing ranked games against much weaker opponents.
So while that player might be able to regain their rank, the effect is spread out across possibly hundreds of other players of different ratings.
A once off can probably correct itself, but then throw them in happening every so often and the rating system probably just starts to get a lot of noise?
I imagine something like this is probably why mass timeouts don’t just count as losses.