2020 Rating and rank tweaks and analysis

meili_yinhua · August 30, 2020, 8:59am

From a statistical perspective, however, if we assume that all the previous data has been calculated to provide a distribution that properly reflects the expected distribution of your play, then any additional data points should update that accordingly. Previous data should have already been accounted for.

Hence a proper system of rating should never take a new piece of data and make an update in the opposite direction. I understood the previous system, as it was updating the data of previous games, thus it makes sense that the information gained could overcome the update of the most recent game easily.

But if I flip a coin (which we do not know is fair), and we have an estimation of how often it flips heads or tails, and it flips heads, sure that’s not always a good indicator that heads is more likely than your previous estimate, you should not be able to take the model and go “well the last few flips that we’ve already accounted for were already tails, so let’s update it to be more likely to flip tails.” because that information should have already been accounted for in the model.

Not to mention, the way this system currently goes, it’s not just trying to smooth things out – if you have an upset that would only happen 1/100 times, which is already going to make a notable change, suddenly under the current system, that’s happened 15 times in recency, creating a much bigger set of outliers in the data in a system where recent results matter more than old ones by design.

flovo · August 30, 2020, 9:12am

One of the problems with sliding window is, we get 15 loosely correlated ratings. The new game doesn’t change the rating estimation relative to the last available rating, but to the rating 15 games before.

I don’t understand what you’re trying to convey.

meili_yinhua · August 30, 2020, 10:39am

What I mean is that under the sliding window, you – eventually and inevitably – have that game affect your rating as if it happened 15 separate times according to the theory presented by the model that underpins the glicko-2 system.

How is this possible? Well, when it is first played it counts once and contributes its effect to the ratings period, this of course makes sense if it ends here, but then you play another game, and this game is now in that ratings period, contributing its effect a second time, and another game for a third time, and by the time it leaves the window it will have contributed its effect 15 times what it would under the glicko-2 system as described by glickman.

DVbS78rkR7NVe · August 30, 2020, 10:49am

Are you sure? I’m not smart enough to tell if it’s right or not. Because we’re evaluating something that changes in time also, anything can happen there.

I imagine the system doesn’t just bring your rating close to your “performance” rating in the last 15 games. It has implicit or explicit parameter of how fast the ratings are allowed to change. So if you suddenly play worse it’s going to slowly bring you down to your recent performance rating. And if your recent performance is abysmal, one win might not be enough to offset it. If we think about it in this way, bad performance has delayed effect.

At the end of the day, rating system generally doesn’t have to make sense for us (except that we like to understand things). As long as it predicts the results well, it could be a random number generator. Hopefully, open sources ratings will develop enough to be used even by simple people so we can try things. I’d love to plug a neural network in there.

meili_yinhua · August 30, 2020, 11:00am

that actually makes it worse, because if you don’t know how it’s changing over time (such as theory dictates as true rating is essentially modeled as brownian motion), then new results should be even bigger indicators to indicate the change in position (as the systems of both Elo and Glicko are designed to do)

similarly good performance has a delayed effect, and from a theory perspective where you’re trying to make the best matchups based on rating, that makes no sense to not just update to the most likely rating.

I’d argue that’s not quite true. Glicko and Elo are successful systems because they have sound theory underpinning their mechanisms (although Elo a couple simplifying assumptions that made more sense before the era of computers). While you could in theory plug in an NN trained on predicting this kind of thing, you lose control over knowing what kinds of errors you get, and how much, as well as opening yourself up to very wild behavior.

shinuito · August 30, 2020, 11:41am

It would be interesting to see whether something from katago, say like a mean sum of scores (percentage or actual points) for games would actually be an indicator or correlate with rating. I’m not sure which one would be better. I feel like a mean would hopefully take into account games of different lengths although games where big groups die could possibly throw things off, but that could happen at any level.

DVbS78rkR7NVe · August 30, 2020, 12:37pm

I’ll omit my disagreements with your post.

I offer an analogy. When we use gradient descend to optimize a function, the function’s surface can be bumpy and ugly. And the gradient in each point doesn’t directly point in the right direction. So we add momentum to it - we average gradient over multiple points and go in that summary direction (roughly speaking). Even though at some points we might go in the direction that doesn’t match the gradient in that specific point, it’ll reach optimum faster (i mean, hopefully lol).

So I’m not sure the change (direction we go in) has to match the last result (gradient at that specific point) for better overall results.

At least that’s how it potentially might work.

stephan_ro · August 30, 2020, 1:10pm

Maybe all would be better with a number of at least e.g. 100 or even more instead of just 15 games?
What was the reason for the 15?
This appeared very small to me right from the beginning.
So, perhaps, the logic may be right, but this parameter is inappropriatly low.

Groin · August 30, 2020, 2:00pm

Not everyone play 100 games a day

flovo · August 30, 2020, 2:15pm

In glicko2 each game contributes only if the base rating is older than the game, therefore it contributes only once to the rating change.

flovo · August 30, 2020, 2:20pm

100 games result in a very slow adaption changes in player strength, since many players don’t play hundreds of game in a few months.

Right now it looks like the next iteration will either be a one game window or a fixed window with a width of days or weeks.

shinuito · August 30, 2020, 5:03pm

I’ve like 250 games total played. It probably took me a long time to even get to 100. I think there was a while where I was mostly playing correspondence.

meili_yinhua · August 30, 2020, 9:05pm

This makes sense if you have a function that is expected to have a general pattern of direction but is “bumpy”, as is the assumption with many weights-updating systems of NN’s.

However, Glicko is not formulated on such an assumption. It uses a system of Bayesian inferences (albeit slightly simplified with the assumption of a Normal Distrubution with mean Rating and standard deviation Ratings Deviation) to track the most likely estimate of someone’s rating, as well as the current most likely deviation. On top of this, it assumes ratings are moving with brownian motion with variance over time volatility².

Now why does brownian motion matter? Because the formulation of brownian motion includes the idea that it doesn’t matter what direction it was just moving previously, it has equal chance to continue that direction as it does to suddenly switch direction (in 1 spacial dimension). As such, if your past 5 games were already calculated to find the most likely estimate for this game, there is no “momentum” calculation based on those games: the only relevant data in the update is the new information.

Of course, the question of whether or not this assumption is correct is another thing, and one could make a theory that might model with an assumption of “momentum” in mind. I do believe, however, that the fact that no such model has created a ratings system of such popularity as the brownian motion models (including Glicko, Whole History Ratings, Edo, and Trueskill) is an indicator (albeit not proof) that the assumption fo brownian motion is more accurate than the reverse.

As described by glickman yes, but my understanding of the “sliding window” has each of the last 15 games (so long as they are in the last 90 days) count towards a ratings period updating from the last ratings period, despite 14 of those games being counted in the last ratings period. This is where the “15 times” comes from.

So a lot of this started from the assertion by glickman that the system “works best” if it has ratings periods with on average 10-15 games in it. This is not to say that the last 100 games don’t count, it’s just that recent games matter more, the games before these last 15 still “pushed” your rating in a direction, and while the effects of that push may become less and less notable over time (games a year ago reasonably should matter a lot less than games today in accounting for your rating). The advent of ratings periods is mostly for looking at “macro-scale” parameters that have some foundation in time (mostly the connection between ratings periods and volatility). As such, there is a tradeoff between large ratings periods and small, Glickman (in this paper on glicko-1, section 3.2) says

The choice of the length of a rating period involves a variance-bias trade-off. For short
rating periods, few data may be available to estimate players’ strengths, and the analytic
approximations used in the algorithm in Section 3.3 may not be dependable. Conversely, if
long rating periods are used, a player’s ability may have changed substantially over a rating
period, but this would not be detectable. The best compromise seems to be rating periods that
are as short as possible, but where enough data are available to have some indication of
players’ strengths, perhaps at least 5-10 games per player on average.

5-10 games being his estimation of glicko-1’s best parameters. Surprisingly, a similar paragraph is not made (as far as I can find) in his paper on glicko-2, but seems to come to a similar conclusion in his description of the system without derivation:

The Glicko-2 system works best when the number of games in a rating period is moderate to large, say an average of at least 10-15 games per player in a rating period. The length of time for a rating period is at the discretion of the administrator.

flovo · August 31, 2020, 5:03am

The 10-15 games are mentioned in his implementation example http://www.glicko.net/glicko/glicko2.pdf

The sliding window used by OGS uses the rating 15 games ago as base rating, therefore no game counts multiple times for the rating calculation. Basically sliding window creates 15 correlated ratings, all calculated as described by Glickman, and used in rotation.

meili_yinhua · August 31, 2020, 5:16am

I guess that makes more sense than what I understood from the descriptions. It still removes the basis in time that ratings periods are intended to have, but I’ve already voiced that problem on multiple occasions and isn’t new.

It sounds like it creates a bunch of unnecessary variance though in the representation, as the placement of games in ratings periods by glicko is to give it chronology, and the rotation of 15 ratings each with shifted ratings periods is a bit like having Elo with games in a mildly shuffled order (although the Elo example would be a tad more error-prone), but at least now I understand why such an implementation would at least seem to make sense, although I’m not sure why it would be intended to make things less volatile.

snakesss · September 2, 2020, 11:27am

Please, w©ould someone explain in the simplest way possible how ogs glicko2 and the 15 game slide works? Mekriff explained it to me before but now that it turns out even that was wrong, I can safely say I don’t know a single active user on ogs who understands how it works. How do games affect the rank change only once? When I play a game and my rank changes, what affects that change? My last game? My last 15 games? Only the game I played 15 games ago?

I’d be glad if someone could explain this with as little math knowledge assumed as possible. (basically translate what flovo meant above, I suppose)

(My previous understanding per mekriff’s explanation was that each time you played a game and your rank changed, every one of your last 15 games had 1/15 effect on that rank change - so i thought the system calculated the rank change as if you played your last 15 games at once)

flovo · September 2, 2020, 4:11pm

Your last 15 games change your rating relative to your rating before these 15 games.

meili_yinhua · September 2, 2020, 7:34pm

It’s probably best to work through an example Keep in mind this is all assuming these games happen all within a 90 day period.
Let’s say you have no games to start and are going through, and have starting rating R₀

After your first game with result X₁, you make a ratings period where the initial is R₀ with only game X₁ to get output rating R₁ (and ratings period and volatility, but I’m focusing on rating)

Now that that’s happened, you get a second game with result X₂, where you make a ratings period where the inital rating is R₀ (Not R₁) with two games (X₁ and X₂), to get output ratings R₂.

Once again, you do it for game 3 with result X₃, using once again R₀ as the initial and X₁, X₂, and X₃ are the games within the period, outputting R₃

This continues exactly like this through the 15th game, once again using R₀ as the initial and X₁, X₂, X₃, X₄, X₅, X₆, X₇, X₈, X₉, X₁₀, X₁₁, X₁₂, X₁₃, X₁₄, and X₁₅ as games to update, outputting R₁₅

but at 16 is where the sliding window comes into effect:

Now it starts a ratings period with initial rating R₁ (the one that resulted from only one game in the system), across X_{2,3,4,5,6,7,8,9,10,11,12,13,14,15,and 16} (notice X₁ has transitioned out of the window), to output R₁₆

in case it’s still not clear exactly what’s going on, let’s go on to game 17, which starts at initial rating R₂ (the rating from 15 games ago) across results X_{3,4,5,6,7,8,9,10,11,12,13,14,15,16,and 17}, (once again, note X₂ has also transitioned out of the window) to output rating R₁₇

snakesss · September 2, 2020, 7:59pm

So, does that mean that the game that gets dropped out of the last 15 games window has the most effect on the rank change? Until then, it only had 1/15 effect but when it becomes part of the base rank, the 14/15 gets added at once? Does that mean the rank change gets affected half by last 15 games you played and half by the 16th game from last that gets dropped out?

For example, if I am a 5kyu and my last 15 games are wins against 5kyus and my 16th game from last is a defeat against a 5kyu; after my last game, my rank will not change?

meili_yinhua · September 2, 2020, 8:28pm

I don’t think this is quite the right way of looking at it, the very first few games definitely had a lot more effect on the ratings than the rest will, but after 15 they become similar in worth, although it can be unpredictable as flovo said above, this gives you 15 ratings that rotate in every 15 games (assuming all games were within the last 90 day period).

Not necessarily… after that most recent defeat you’ll see a ratings period where you had won against a 5k vs a ratings period where you had 14 wins and 1 loss against 5k, but before that defeat you’ll only have the ratings period where you had won 15 straight games against 5ks. I’m almost certain the 15 straight games will have a higher rating than the 14 wins 1 loss preceded by the win, but if the 15 straight wins had a lower starting point than the 14 wins 1 loss period, it’s hard to tell