I’ve only looked at it briefly myself for the ratings calculation code. I haven’t actually tried anything with the data.
I’m sure if we’ve specific questions and there’s time anoek will answer them. The forums is full of those sorts of q&a things about ratings etc. I don’t think there’s any reason to stick to the overall rating except that it seems to be working well.
I think the idea is always to keep the go server working as best as it can
I’d say, and I don’t know anything (just speculating), if it turns out the rating system is not perfect for 0.001% of players, it’s probably not going to be worth overhauling the whole system.
If it’s a significant fraction of players it’s a different story. Basically you can also play in a way that makes a rating system predict your rating poorly by adding enough variance. You can play when you extremely tired, or when you’re drunk, or you can play so many games in a row that you’ve basically stopped thinking, you can blitz or you can play super long live games that it kind of becomes an endurance test.
If you’re playing in a volatile way, hopefully a decent rating system will just adapt and maybe do a decent job of predicting your mean rating, but show the uncertainty in terms of a variance. I feel like glicko2 is supposed to do this in some extent.
I think if the rating system isn’t performing badly as a whole, small edge cases (assume they are small and edge cases) probably don’t need to be corrected really?
I presume the referenced test by anoek was comparing these 2 strategies:
Always use the overall rating to predict the outcome of games on all board sizes and speeds
Always use the appropriate specific rating for size/speed to predict outcome of a game with those settings
And the result was 1 was better than 2. I’m not surprised, if in many cases the specific ratings had high uncertainty due to few games. But strategy 2 there is pretty dumb and a bit of a strawman argument, because there’s a more intelligent 3rd strategy:
Use specific rating if its rating uncertainty is similar to that of the overall rating, else use overall. Similar is a parameter that could be adjusted, say no more than 130% of overall. So if overall is +/-0.7 and specific is +/- 0.9 use specific, but if specific is +/- 2.0 use overall. This way you only use the specific ratings where you have good reason to believe their accuracy.
someone can get clear 20k rank on 19x19, then play 0 19x19 games for some years, reach 5k on 9x9, then return on 19x19 and rating uncertainty of 19x19 will not increase.
Currently uncertainty does not increase with time. It should be changed. Or additional detector for outdated rank is needed.
Yes, there are still edge cases which could have not so great behaviour, but I expect my strategy 3 has better results in the vast majority of cases. I just randomly looked at a player in top active game:
Use 3.8d not 3.5d in 19x19 live games. Don’t use 4.7k in 9x9 live games. Pretty obviously sensible.
Haven’t used this account to play in a long time but just check out my account as an example. It is ranked 7.1k for 13x13 correspondence while it’s ranked 12.8k for live 19x19. The reason for that is also pretty easy to understand. I started playing Go for the first time in summer 2022 and this was my first account, I got used to 9x9 and 13x13 pretty fast because these boards appeared to be smaller and more straightforward. When I started playing 19x19, it was not only that the new larger board was difficult to get used to but also that the rating algorithm kept giving me opponents that were way too strong because it used my better rating from the other board sizes. This severely impacted my experience when playing 19x19 games. As you can see, I have 1 win and 8 losses for 19x19 live games and the first thing I did after that was creating a new account to play 19x19. By now I have 11 different OGS accounts.
This could also be a self fulfilling prophecy though. Because of the way OGS handles the ratings I keep creating new OGS accounts (11 by now) which each have a high uncertainty due to relatively few games. If OGS would just differentiate the category ratings then I would only use one single account.
FTR, my experience is that anoek is open to all changes to ratings. But, OGS needs to be very conservative about:
collecting data (via goratings) ahead of time, to be sure it will be an improvement
frequency of changing the live system, because each one is disruptive and expensive (since it involves recalculating the full history of ratings)
Agreed. I floated this with anoek a few months ago and he agrees it’s worth looking into. I haven’t yet had time to implement this in the goratings repo. If someone else is motivated and has the time, please loop me in to the investigations because I’ve already done some thinking. (I have the same username on GitHub.)
(By the way, the term Glicko-2 uses for uncertainty is “deviation”. This is the term of art used in the goratings repo.)
This is correct; currently, there is currently no “time” element to increase uncertainty on OGS.
Glicko-2 is designed to be time-based, whereby games are evaluated together in a “period” (of some length of time), and ratings+deviation are recalculated at the end of each period. A period with no games would increase the deviation without changing the rating.
The implementation here evaluates each game isolation, as if it’s the only game in a period (of variable length).
The goratings repo does have (behind a flag IIRC?) some degradation of deviation over time. I landed this a few months ago. It’s a start, but IMO not enough.
A third issue, not discussed in this thread yet I think, is that accounts with players that play lots of games can see massive fluctuations in ratings over a short period.
For example, amybot-ddk, which is a bot account, has fluctuated between 7.6k and 24.1k in the currently visible rating history (last 5000 games, which is about a week for this bot).
A player whose strength isn’t changing that has 5000 games per week should have a fairly stable rating with a very low deviation. But the deviation never gets as small as one might expect.
There are probably a number of factors contributing to this volatility in ratings and surprisingly high deviation.
“Time” isn’t used in the OGS ratings calculations. This means that 10-20 game winning or losing streaks cause wild swings in ratings, even if over the course of a day the bot plays consistently.
Maybe amybot-ddk has a different strength for different board sizes.
If a player times out when playing a bot, the game is not rated. So “resign by abandonment” depresses bot ratings. And since bots don’t care either way, humans don’t consider it rude to abandon games to them.
It’s not just bots. I think ratings are a bit weirdly volatile on OGS. Anecdotally, it seems like the more games you play, the more your rating fluctuates. But in a properly function system, it should be the other way around (unless your playing strength is actually changing).
Over the next few weeks, I’m hoping to find time to write up a proposal / summary document of what I think should happen, why, and what data need to be collected to support the changes. But here are the high-level pieces of what should change (with some discussion of why, but no data!):
Add a time element.
E.g., a sliding window of one week.
But probably something more complex (e.g., sliding window of 1 day, but look back up to a month to try to get a minimum of 10 games; if looking back, “age” the starting rating to only 1 day old before running the update).
Make each rating category (mostly) independent. E.g., something like this (but need data to fine-tune):
If this rating category (e.g., live-9x9) has a game in the last month(?), just use it.
Else, compute a blended rating from the parent rating categories (e.g., live and 9x9) by computing a weighted average and combining the parent deviations; if the blended rating ends up with a lower deviation, use that.
Re-evaluate whether (and if “yes”, when) correspondence timeouts should be annulled.
The primary harm from NOT annulling correspondence timeouts is that the returning player needs to defeat “a lot of opponents” to get back to their true playing strength.
I think it’s possible that we don’t need this protection anymore. Glicko-2 allows ratings to change quickly (won’t be too many trounced opponents). Also, only the correspondence rating will have tanked, and if they’re gone long enough, they’ll fallback to the blended rating anyway.
Stop annulling games when players abandon bot games. Give bots the win.
If players want to abandon bot games without it affecting their ratings, they should use “unranked”.
Else, we should assume the player abandoned because they felt they were losing.
Up until a few months ago, OGS ratings (v5) assumed that a handicap of N stones equated to an effective rank adjustment of N
For 19x19, was mostly correct, but has the flaw you pointed out.
This was grossly incorrect for small boards.
Now, OGS ratings (still v5) is told directly what the effective rank adjustment should be (“handicap rank difference”), but it’s always an integer.
For 19x19, no change. It’s the number of stones. Still has the flaw you pointed out.
For small boards, this is the number in parentheses in the “Game Information” panel. E.g., a handicap 9x9 game might have “Handicap: 2 (Rank: 8)”; this means 2 handicap stones, but overall a handicap rank difference of 8.
This was landed as a bugfix to ratings v5 without recomputing historical ratings because it seemed worth doing quickly (to stop actively trashing ratings).
It’ll be fixed retroactively once ratings v6 lands.
goratings (v6, in progress) is smarter, and computes the effective rank adjustment from the combination of handicap stones and komi (and ruleset).
I.e., fixes the flaw you pointed out.
Allows rating system to evaluate non-standard komi in general (so, with this in place, we could have a policy change to allow rated games with arbitrary komi).
Will come in on the next major ratings change (v6), whenever that happens.
It certainly could be the case, and I believe it that it might work better at predicting 9x9 and 13x13 games assuming you have an accurate 19x19 rank, as opposed to the other way: you have an accurate 9x9 or 13x13 rank but are inexperienced at 19x19.
It could be that this is a much more difficult transition, you could win a lot of local fights but lose overall.
Yeah, honestly I thought we would have landed v6 by March, but I got busy in February and didn’t have time to continue the experiments. So “whenever that happens” is not well-defined at all.
I won’t really have time to design/run ratings experiments myself until September-ish (at least I hope September). That’s why I’m aiming to write up a document in the next few weeks.
Useful for me when I come back to it, if nothing changes between now and then, but also:
If someone else is keen and able to design/run experiments in goratings, they’ll have my thoughts as a starting point in the document, and I can probably find time to give feedback on their changes/experiments (even if that’s all I do) (also, anoek will have useful feedback, so such a person wouldn’t be working in isolation regardless)
I think the problem with bots is that they have weaknesses that human players just don’t have beyond a certain rank, so I think they’re an edge case that won’t be so useful to look at (I think).
Essentially imagine that a player around 7 or 10 kyu just randomly forgets that they need two eyes to live, they’re winning the game but they for no reason fill in an eye instead of passing or play a self atari.
Imagine they now do that many many times a day, and that’s effectively how some of these bots behave.
I don’t know that it’s really comparable to how humans play go beyond a certain level.
Ratings are just “average strength”. The more games, the smaller the error. “Many many times per day” should mean that ratings are stable and well-understood, not volatile.
Concretely, say we use a time-based period of 1 day (or a 1 day sliding window). For amybot-ddk, that’s about 800 games per period. From that corpus, we should be able to compute an accurate (and quite precise!) rating for amybot-ddk’s average strength.
Do all changes really need to be retroactive? Something major like switching from Glicko to Glicko-2, sure, but if OGS said “henceforth, we will use board size/timing specific ranks if they have similar deviation to the overall rank for rating calculation” I don’t think users would be out with the pitchforks demanding that also applies to the 9x9 game I played in 2005. Feels like a self-imposed restriction that makes things harder than they need to be. Or is there some technical reason why the entire rating system needs to be replayable from year 0?
Indeed, I found it kind of sad when these rating updates retroactively destroyed (almost) the history of the ratings you had when you first starting playing.
When I started, I was certainly played like a 20-25kyu, and the massive rating recomputations somehow make it appear like you were 10kyu when starting.
I think only the old rating survived in chat logs and things of that kind.
But, whatever volatility in strength a player (bot or not has), playing more frequent games gives the system more data and should result in a more stable rating. Indeed, Glicko-2 was designed that way.
But the implementation on OGS does not work like that because it does not use a time-based period. Instead of the ratings graph becoming more smooth when there are more games in the same amount of time (converging to a smooth curve), it just oscillates at a higher rate.
It’s about as extreme as it could be, because OGS looks at each game in isolation. Every ratings calculation is looking at a surprising data set, because the player either lost every single game in the period or won every single game in the “period” (not really a period, just a single game).
I think it depends on what you mean by “technical”. I don’t think there’s an architectural reason; just policy; but there are technical benefits to reproducibility. For example, this means it’s possible to run experiments on the historical dataset to validate a change, and trust that what you learned will apply to the “live” OGS system in practice.
EDIT: Yeah, I had something related happen. When I joined, I was about 4k, and entered my rank as such. I played a bunch of correspondence games against friends, most of whom were beginners. I played in a few correspondence tournaments, but not many. When I returned a decade later, I discovered that my initial games had been recalibrated to high dan. I believe this happened because all self-selections were dropped on the floor in the initial switch to Glicko-2, and everyone was just assigned “1500”.
Unfortunately, I think those self-selections (or the lack thereof) were completely lost. It’d be really nice to be able to use them as input (when doing experiments and/or when rebuilding). If I understand the old threads correctly, they were dropped because many players were intentionally reporting a much weaker than true rank.
On the plus side, the new OGS beginner/intermediate/advanced selections ARE being saved.