Rating system issues

anoek · June 30, 2020, 7:42pm

Thanks to @flovo we’ve discovered a couple of issues with the rating system. The first, in the production system I royally messed up and forgot to account for handicaps. Big whoops. So, everyone who was noting that ranks were feeling really strong, yeah that’s why. Even if you yourself don’t play handicap games, your peers who might are going to get pulled up or down by it, which is going to tug your rating in the same direction. @S_Alexander even pulled some stats showing that a lot of dans had lost rank, this is the cause. What about those pretty graphs I made you say? Well, I broke the cardinal NASA rule - test what you fly and fly what you test. I had my test code that I would run all of the data through for quick analysis, and then the production system which ties into the database and all of that, and when I ported things over I somehow missed that. My spot checks seemed a-ok, but I didn’t do a proper test to make sure the data was aligning in all cases.

The second failure he found was that we’ve had a long standing issue that was causing our deviation to not converge as much as it should, which caused an increase in rating volatility. So, when you play a few games and your rating changes too much, this is why. It still bounced around the right rating, so it still functioned as an adequate metric for matching, but it was too volatile.

As of a few hours ago the handicap issue has been fixed going forward, however the effects are not applied retroactively yet. I’m not going to alter the deviation issue quite yet because I want to fully understand how that effects things before moving forward. We will be repairing the ratings retroactively once again, but it’ll be a couple of weeks.

There’s the other user experience issue of the sliding window system allowing for a drop in your rating even after a win. This seems to happen somewhat frequently and is both confusing and demoralizing, so I/we are going to go back to the drawing board a bit to come up with an alternative system that is more intuitive in that your rating should always go up on a win and down on a loss, even if only by a little.

The plan now is to separate out the ratings code into an open source module that I’ll be putting up on GitHub in the next day or two (I’ll reply with a link to it in this thread for those that are interested). There we’ll work on the next iteration of the rating system in the open, anyone who wants to help us figure out the best way to do things or play spot the bug is welcome to help out. It’ll have a full dump of the production data so we can analyze how different approaches perform against the approximately 12M rated games we have. Once we have a solid strategy, we’ll validate the implementation strongly and then it’ll be exactly what we use in production.

Sorry for the churn, and the yet-another-rating-blunder.

DVbS78rkR7NVe · June 30, 2020, 8:01pm

Whichever god set up the correspondence corruption, they must be laughing right now.

Jhyn · June 30, 2020, 8:23pm

OGS: the only app that you beg for a good rating.

(yes, I know OGS is not an app. Let me have my fun.)

Gia · June 30, 2020, 8:36pm

Yeah, the one time my rating gains 100 points out of nowhere, the system is corrupt. Thanks @flovo I guess .

Clossius1 · June 30, 2020, 9:39pm

So if we are talking about what ranking system should be used, I’d like to mention my favorite. FoxWeiqi’s rank system!

Now I don’t just say that because I’m 5D there. That is not as important. What I like about it is the simplicity of it.

fox rank capture 1

In the above picture I can see exactly how many games I need to rank up and how many to rank down. It also allows for double rank ups if you do well.

In addition to this you also start from 0 games when you rank up and then it only uses your last 20 games to determine your ability to rank up. This mean I will at least get to play 20 games at my new rank before ranking down or up again. This prevents rapid changes.

Note: it is 20 games for dans, less for kyus.

Another great thing about this system I enjoy is the amount of games it takes to rank up changes per rank. In other words, your win/loss ratio required to rank up or down after 20 games.

Rank - Games Required - 1 rank up - 2 rank up - 1 rank down - 2 rank down

18k - 10 6 8
17k - 10 6 8 7
16k - 10 6 8 7 9
13k-15k - 12 7 10 8 10
10k-12k - 14 8 12 10 12
6k-9k - 16 10 14 11 14
3k-5k - 18 11 15 12 16
2k-2d - 19 12 16 13 17
3d-4d - 20 14 18 13 17
5d-7d - 20 15 20 13 17
8d - 20 15 - 13 17
9d - 20 18 - 13 17
10d - 20 18 - 13 -

So If you notice above the amount of games require to rank up actually change as you get stronger. This is actually rewarding for weaker players who can improve quickly.

Additionally, the stronger you get the tougher it is to achieve the next rank which makes having a dan rank worth something.

Balance wise, it is much easier to adjust the win/loss ratio required at each rank until you feel that the server is at a good average level.

So let’s take me for example. I’m 4D AGA and 5D fox but maybe only 1d or 2d EGF/KGS (I dunno don’t play there.)

Depending on where you would want my level to end up you could adjust the win loss ratio required and see how we do against each other.

The down side to this is bots can be abused depending on how they play. My opinion on this is bots should only count towards your rank for new accounts not established ones. But that’s a topic for another day.

In conclusion, the things I like about this system are…
1: Easy to understand for the user.
2: Rewards players for improving quickly.
3: Balances players by win/loss ratios per level instead of server wide.
4: Easy to adjust over time and balance.

Hope this helps!

anoek · June 30, 2020, 10:01pm

I’d be curious how that system performs for handicap computation.

I personally will be focusing on the same glicko2 foundation we’ve been running with, but if someone wants to take some time and run the games through another rating/ranking system all together so we can benchmark other systems, if there’s something that works demonstrably better then I’m all for it. But, it does have to perform well for matchmaking and handicaps.

Clossius1 · June 30, 2020, 10:03pm

One issue I have with handicap is the balance between ranks. At high dan level rank difference is about 6 points roughly or half a stone so reverse komi works better.

The problem becomes the stronger you are the more a handicap stone is worth.

meili_yinhua · June 30, 2020, 10:07pm

The key thing with “ratings periods” as per Glickman’s paper, is that they are supposed to be periods at which ratings are not expected to be much different from each other, like a tournament day. As such I’d likely run a sort of “pure” glicko approach where the ratings periods are discretely defined on a certain time period (and maybe separate live and corr because of this), but when displaying it to the user, it calculates as if the ratings period has finished, although doesn’t “commit” the update until the ratings period has finished, and game calculations are done by the prior rating and RD. It seems to me to be the most natural way to implement it, especially since that’s how you get time-based RD/volatility updates, but I guess lichess has 1-game ratings periods so probably could check out how they run things

anoek · June 30, 2020, 10:11pm

I will be exploring that very thing @mekliff

DVbS78rkR7NVe · June 30, 2020, 10:20pm

Foxy has simplicity but it has a cost too. The cost is absolutely insane number of games you need to play to change anything.

15-20 games is way too much for anyone who doesn’t play go 24/7. As a result the ranks adjust very slowly (and therefore less accurate).

If I recall correctly we switched to Glicko precisely because our ranks were slow to adjust and mods had to change them.

And it can be discouraging too. When you have one game deciding whether you go up or stay on the same level and you lose. You realize you’ll have to win like 15 games in a row to even get to this position again. And I assume people try to do this thing where they pick who they play because of that. Because every opponent worth the same. Even if they have 20-0 game record or 0-20.

Plus, let’s not forget. On Foxy you play only against your level (or ±1 without komi) for ranking. As far as I understand, there’s no ranked games between different levels. They can do that because they have very large player base. We don’t.

In conclusion, regarding foxy-like ranking on OGS: “No, God, please, no”. I love foxy but in a very special way.

I really believe rating-based system where you win - you get some points, you lose - you lose some points is a way to go.

Go_Board · July 1, 2020, 1:27am

@anoek, I just want to say that I continue to marvel at the work you do for all of us. I am profoundly amazed by the new tweaks and features you continue to add as time goes on. The AI addition was absolutely spectacular and now we are getting an improvement to the rating system which is even better.

I enjoy playing on other servers such as Fox and Tygem but I have to say that I take pride in the OGS rating system. I agree with @S_Alexander that this glicko-based should be kept as it gives OGS a uniquely sophisticated feature that other servers can’t brag about.

Every server has several features that make it unique and desirable and OGS is no different. I wish I could dive deeper into the idea of this rating system but the math is unfortunately too deep for me.

Thanks for everything as always! You have some real talent!

Tokumoto · July 1, 2020, 3:07am

In over 1000 years of Go history, OGS may be in an unique position to do nobody, and no organization, was able to do, ever.

When the weakest and the strongest, and everybody in between, are placed in data proven strength-bands of kyus and dans spaced logically, and the logic and the data backing it up are disclosed for every statistician in the world to examine, OGS Rank will have a good chance to become THE gold standard in the entire Go world.

Please keep in mind what Clossius said is very true that mid to high dans have less than 1 handi-stone separating each rank. It means that mid to high amateur dans can feel undeniable difference in strength of much less than 1 handi stone, just like the kyus can feel between 9k and 10k. The number of serious games high dans play in real life is so much less than kyus and the real life ranking systems almost never had sufficient number of game results to accurately (statistically) assess which high dan is stronger than which high dan.

So the real life high dans have (and prefer) less than 1 handi separating each rank, “despite this difficulty”, may be a strong proof that ‘ideal’ Go ranking system should also have this characteristic.

ps. Hiring Mark Glickman may be too expensive. But luring him into playing Go on OGS may not be that difficult. Anybody taking his class at Harvard?

meili_yinhua · July 1, 2020, 3:13am

I would agree with Tokumoto on the “less than one stone”, as even in tradition under the four schools system, dan ranks were usually 1/2 a stone (or less) apart, 1 rank apart meant sen-ai-sen, 2 ranks josen (no komi), 3 ranks sen-ni-sen (alternating no komi and two stones), and so on

BHydden · July 1, 2020, 7:21am

Since strength and pairing is primarily controlled by rating points, and ranks are simply mapped onto them at the closest point to represent integer handicap stones… maybe we can fix the issues of the past by altering the distance of dan mappings such that, as far as it is possible, they are all integer handicap stones apart (or as close as we can make it)?

Valhall · July 1, 2020, 9:46am

I don’t know if this would work, it’s just an idea.

Normally consider the last 15 games.
If you win a game and your rating would go down, then consider the last 16 games (the last 15 plus the previous win, which must result in a higher or equal rating), and so on. If it would not go down (which will happen eventually if you are on a winning streak), reset to 15.
If you lose a game, reset to 15.
And the other way around for losses.

Valhall · July 1, 2020, 9:49am

Don’t worry too much about breaking things, by the way. It will always happen unless you have a million unit tests, hundreds of acceptance testers and are extremely afraid of breaking things. OGS is still better than all other Go servers by far.

anoek · July 1, 2020, 3:15pm

It is certainly worth considering that that might very well be the desired arrangement.

I’ll certainly run that through the number grinder. Instinctively it seems like that would either create a lot of 9d’s or very few 1d’s, depending on where you anchor the rating system. It would also seem to imply that you’d have this progression effect where the last few SDK ranks would be comparatively brutal to get through, then once you hit 1d it’d be suddenly easier and quicker to rank up (or down), as the ranks would be tighter and perhaps more volatile because of that.

Kaworu_Nagisa · July 1, 2020, 4:08pm

I think what matters most is to have better matchmaking for even games. It doesn’t make sense to me to focus on ensuring that a 10k and a 1k can play “the most even ranked game possible”.

Clossius1 · July 1, 2020, 4:38pm

I’m happy you are considering this perspective. A few additional notes.

Don’t anchor it at 9D. According to a conversation I had with yoonyoung, getting past 6D should be HARD. Getting to 7D can be done in a year or 2 from 6D but 1D-6D can be done in 6 months or so of a very studious student. Then 8D is apparently as hard to get as 30k to 7D. Or roughly there about. 9D is professional level. Granted this is on Asian servers but I want to stress, it’s better to have too many 1D than 9D. 7d+ is supposed to be grandmaster level for amateurs. The HARD ranks to get.

On another note, 2k/1k is actually supposed to have a barrier. It’s more the last kyu barrier that you get where you have to make every move have positive semi effecient value. It’s true dans don’t play perfect but from my experience dans have a positive value move every move 99% of the time. The difference is have efficient we are. This is not possible to compute however but the reason I mention it is there may be a mathematical delimma around 2k/1k/1d to promote and I want you to understand that it may not just be the system but also the mental barrier there. While I want them to be able to promote if you make it to easy that barrier will be pushed to 1-3D or worse blend with a higher rank where they get crushed. This barrier is why 1D have some recognition and why it’s such a huge milestone.

Just food for thought.