Proposal: segment ranking system by board size

pwsiegel · March 1, 2024, 9:20pm

I hope this hasn’t been discussed to death elsewhere, but I have some concrete evidence to contribute.

Take a look at my profile, and filter by “rated” and “19x19”. I haven’t played a ton of games lately in this category, but you’ll see the game record of a solid 1d player: a balanced performance against other 1d’s, and generally losing against stronger players, with a few wins here and there. Indeed, as of yesterday my rank was about 1.7d.

Today, however, my rank is 3d. The reason is that I blitzed out a bunch of 9x9 games for the WSC 2024 event, mostly handicap games against kyu players, and I won a lot of these games. (Though my record against other dan players today was 2-2.)

Reasonable people can disagree about whether beating a bunch of kyus in handicap games is evidence that my 9x9 playing strength is around 3d. But I don’t think it’s possible to argue that my 19x19 playing strength is 3d on the basis of that evidence. I certainly don’t mind that I’m going to get paired with some stronger players when I get back to 19x19, but it’s not good for the rating system overall.

The obvious fix is to formally segment the rating system by board size. Some version of this has already been done, in the sense that I can see my estimated playing strength for 9x9, 13x13, and 19x19 in my profile. But it seems that only the aggregated rating is used when displaying my rating to other users, and when setting handicap / komi.

Groin · March 1, 2024, 9:59pm

It has been. It’s something coming up again and again.

These playing strength on different sizes are only informative. Every decision are not linked to them but to your global rating (like, pairing, calculating a new rating, etc…)

This should be the latest important thread on ogs rating

This announcement has 670 answers and is linked to 39 other threads (see at the bottom of the OP).

In another older announcement on ranking, if I remember well, explaination were given with proof by simulations that there was not enough distortion in the way handling different sizes together to justify a stratification as you suggest. Maybe someone else may provide a direct link.

Edit: ok i found what i think is the most relevant topic

There are some interesting Q/A later in the thread

pwsiegel · March 2, 2024, 7:45pm

I’m glad to hear that the issue has been looked into. But after almost 100 9x9 games in the past 2 days, my rating is 3.7d at the time of this writing. My record against dan players over this period is 4 wins and 8 losses. If my performance continues along this trajectory, next time I play a 1d in a 19x19 game I will be giving them 3 handicap stones, instead of playing them even as I probably should.

So my rating is objectively not correct for 19x19, and the gap between my rating and what it should be is growing. I’ll leave it to others to decide what, if anything, to do with this information.

Samraku · March 2, 2024, 8:22pm

I’ve mentioned it before, but I’ll repeat it here: I would like the overall rank not to exist, and for all places front end and back which currently use it to switch to using one or more of the separated ranks. Lichess is a great example of implementing this in my opinion

The party line from OGS has always been as far as I can recall, that statistically the combined rank better predicts the results of games than individual ranks. I have no reason to doubt this, indeed, even if I did not trust anoek to not fabricate data, I would still suspect it was true (as it is, my trust of anoek is yet another reason to trust this is true)

That said, I think that predicting results is only one (albeit important, even critically so) raison d’ être of a rating system. A relevant one is to feel intuitive: gain points for wins, lose them for losses, get closer to the opponent’s rating for draws. Another is to try to get out of the way of how players want to play the game: many people are better or worse compared to the field at faster time controls, other people are better or worse compared to the field at smaller (or larger) board sizes

Glicko-2 does a great job feeling intuitive, I don’t think this is an issue at OGS (though it does bear noting that WHR would almost certainly predict winners better than glicko-2 at the cost of intuitiveness)

However, the overall rating, especially emphasized as it is as the player’s “real” rating, very much gets in the way of playing how one wants. I have 2 accounts at lichess: one for playing blindfold (significantly, there is no “blindfold” rating on lichess), and the other for everything else. I have like half a dozen OGS accounts: one for 13x13 (and 9x9), another for blindfold, another for blitz, at least another 1 or 2, and the last for everything else. The difference? OGS doesn’t differentiate these ratings, so I fixed the problem for myself with different accounts; lichess differentiates most things, so I took advantage and just play everything (which has a category, so no blindfold) on my main account

PS: you’ll notice some recent 9x9 games on my main account, but note that they’re unrated. I prefer to play rated, but I also prefer to have AI post-game review and to have site supporter on my main account, so I compromised and played 9x9 on my main account, but unrated. When I want to play rated 9x9, I log into my 13x13 (and 9x9) account. If OGS eliminated the overall rating, I would gladly play 9x9 rated on my main account, so yet another way the overall rating gets in the way of playing how one wants

pwsiegel · March 2, 2024, 8:54pm

The party line from OGS has always been as far as I can recall, that statistically the combined rank better predicts the results of games than individual ranks.

I saw this claim in one of the threads that Groin linked to, and I was immediately skeptical. I haven’t dug into the analysis, but I can think of several ways this could look true without actually being true.

For example: my ratings are currently 3.7d on 9x9, 2.7d on 13x13, and 1.6d on 19x19, with a combined rank of 3.9d. Since the combined rank is higher than any of the individual ranks, I’m guessing it has some sort of “momentum” factor built in which assigns higher weight to recent games. I would guess incorporating momentum would improve predictive power quite a bit, because you’ll get players who take a hiatus from OGS to go off and play on Fox for awhile, and they’re stronger when they come back. Hypothesis: the board-specific ratings would be more accurate than the combined rating if they also had a momentum correction.

Additionally, I’m guessing that most players are biased towards one specific board size, and this would work in favor of the combined score. For example, if player X has played 300 19x19 games but only 5 9x9 games, then the combined score might be a better predictor for X’s next 9x9 game than the 9x9 score, simply because 5 games is not much data from which to score the player. Hypothesis: the board-specific rating is more accurate than the combined rating if you filter down to players who have a sufficiently large number of games on that board size.

PS: you’ll notice some recent 9x9 games on my main account, but note that they’re unrated. I prefer to play rated, but I also prefer to have AI post-game review and to have site supporter on my main account, so I compromised and played 9x9 on my main account, but unrated.

I’m with you. The fast majority of my OGS games are unrated 13x13. I do this because I know I’m stronger relative to the competition on smaller boards - I managed to briefly get up to 4d from 1d in 100 9x9 games - but I want to play rated 19x19 games against equal competition. If they stratified the ratings by board size I would instantly switch all of my 13x13 games to rated. I didn’t really want to bother with multiple accounts the way you did, but maybe that’s actually the way to go.

Samraku · March 2, 2024, 9:31pm

As I understand it, all the ratings, both overall and individual, use the exact same algorithm (none of them are derived from eachother, either: they each calculate a rank completely independently), the only difference is the data that they include in the calculation. Glicko-2 has both an uncertainty factor, as well as a factor which measures how consistent you are. There might be some emergent behavior which could be called momentum, but I expect that what you are noticing is that the overall rank has more games to work with, and in your case that resulted in pushing the overall rank up a bit faster than your strongest individual rating. The uncertainty factor is indeed what makes one’s rating more prone to quick correction after a hiatus, which makes intuitive sense

I think the party line response would be that most players don’t have enough games in a wide variety of settings to fill out the individual ratings well (self-fulfilling prophecy, anyone?), and thus for most players, the overall rating is still better. My response would just go back to my position about not solely valuing predicting winners as the raison d’ être of a rating system

Glad I’m not the only one who feels this way. I generally play rated when I can, because I do value my rating being a reflection of my strength, just not to the exclusion or marginalization of all else

It does bear noting that I believe multiple accounts are technically against TOS, but I dedicate each one (besides my main account) to a specific thing and only use it for that, and I never ever play them against eachother, and I haven’t had any issues with it

EDIT: the premise underlying the above paragraph appears to be false, as seen in the next few posts of this topic

Groin · March 2, 2024, 10:01pm

I don’t think so (at least as long as you don’t use them for any cheats or manipulation of your rank)

Samraku · March 2, 2024, 10:05pm

Whether officially or unofficially, I suspect you’re right

Groin · March 2, 2024, 10:08pm

Something i read here from mod some time ago but guess i’m not going to search for it in the mess of our forum.

Having dedicated accounts by time or size seems to be quite common, and the AI analysis not shared between them too as a side problem.

Samraku · March 2, 2024, 10:10pm

Samraku · March 2, 2024, 10:11pm

Our mess of a forum has a search feature to help with searching

Groin · March 2, 2024, 10:12pm

O sure, being lazy at times…