Should never-used ratings categories follow the overall rating?

In the process of implementing Proposal: New users choose beginner/intermediate/advanced and drop ranked game restrictions, a question has came up around never-used ratings categories. Should these follow the parent ratings category (with a high deviation)?

(For more background, feel free to have a look at the discussion in a related draft pull request.)

The situation

Concretely, say there are three players, A, B, and C, and they each play 1000 (or 10000) games on OGS after creating accounts.

  • A plays an even distribution of live/blitz/correspondence and 9x9/13x13/19x19.
  • B plays only live 19x19 games.
  • C plays 1 experimental game of each category to start. Then the remaining games are live 19x19.

After these 1000 (or 10000) games:

  • Player A will have accurate ratings in all categories.
  • Players B and C will have accurate ratings for “live 19x19”, “live”, “overall”, with ~1000 (or ~10000) games of history.
  • Players B and C will have a provisional rating in other categories, such as “correspondence 9x9”, with either 0 games (B) or 1 game (C) of history right at the beginning.

The question

The question at hand: what should player B’s provisional rating for “correspondence 9x9” games (a never-used ratings category) be?

  • Their original starting rank? (E.g., 1500 ±350 until a 9x9 correspondence game is played.)
  • Their current overall rating? (Overall rating ±350 until a 9x9 correspondence game is played.)

(Out of scope, for now: what should player C’s provisional rating for “correspondence 9x9” games be? Assume, for now, that it’ll be the status quo, whatever rating they earned with their single game in that category. This is presumably close to their original starting rank.)

Proposed change: system guesses the rating as accurately as it can

IMO, taking player B’s overall rating makes the most sense as the start point. The primary goal of the ratings system is to help players find fun/fair games, and this is the system’s best guess for their strength. Since it will come with a high deviation, it’ll still adjust quickly as they play a few games.

I think this works well regardless of how accurate player B’s starting rank was.

Concerns with making a change here

@anoek brought up three concerns, and rather than discuss just the two of us, he suggested we open it up to the forum.

I’ve quoted his concerns and included my own responses (but keep in mind I’m writing this so it’ll have my bias…).

1. The difference between player B and player C is “not fair”.

Specifically, if you have two people that are about the same rank, one plays a bunch of games with only one setting then it lifts all of their ratings, where as if the other has played even just one game in some other category, now they’re effectively hampered and “have to play more” to get to the same place as the first player. This is a weird incentive that seems like we should avoid, we want people to play whatever they want to play whenever they want to play.

I agree this incentive would be unfortunate. I’m (perhaps naively) optimistic that few players would actively shun game categories to “save” their ratings for later, but it’s possible that it could distort things somewhat.

IMO, we should get accurate ratings for player B if we know how, even if we don’t immediately have a solution to improve the accuracy of player C’s ratings. IOW, the benefit here outweighs the risk.

2. Ranking up is part of the fun.

The act of ranking up is part of the enjoyable and rewarding part of the experience, short cutting that process isn’t necessarily desirable.

Agreed, but I don’t think this would in practice cut that process short.

IMO, when you play a game in a never-before-used category, you’re initially grinding through the “what’s my strength?” process. You want this grind to be as short as possible so that you can get to the fun part, which comes when you’re getting fair games (not when you’re beating up beginners or losing to experts).

Note that player B’s starting rank isn’t necessarily weaker than their overall rank (the grind might be “up” or “down”). E.g., the notional starting rank might be 6kyu (joined before beginner/intermediate/advanced split) or 2kyu (clicked “advanced” because they’re better than their friends—hopefully a rare case, but probably there will be a few). Then they play their 1000 (or 10000) games, settling in at 18kyu (or whatever).

3. The ratings are split into categories for a reason.

People have legitimately different strengths for different settings, sometimes by multiple stones. It’s probably true that starting them off in their fallback rank is closer to their true strength though, so maybe this isn’t really a valid concern.

I agree with both parts. IMO, there’s nothing to worry about there, because we’re just looking for a best guess.

Questions for the forum!

What do others think? Any other concerns?

4 Likes

I’m confused by this proposal. Are the ratings categories used for matchmaking? Or are you proposing that?

2 Likes

Whether partial ratings are used or not for matchmaking, I agree that for player B, (overall rating) ± 350 is a better guess than the standard 1150 ± 350.

On the other hand if something is done for player B, I feel that at least in the future, something should be done about player C.

2 Likes

I’m wondering why 1,000 games was chosen as the illustration (let alone 10,000). Loads of people, perhaps the majority on OGS, do not have that many. Surely 1,000 isn’t necessary for statistical accuracy.

2 Likes

Perhaps the total number of games is irrelevant. I’d trust more a rating based on 20 games played last month than a rating based on 10000 games 10 years ago. A better algorithm would be something like: to calculate the partial rank, start from the overall rank, and adjust it using games that have been played at most 365 days ago.

2 Likes

I think less, not more, integration between nominally different ratings is desirable. Lichess does this well: no overall rating

2 Likes

There is a big difference between chess and go: size of the board. Ranks thus take longer to adjust in go.

So, games take more ply, so players can’t play as many, so ranks get data more slowly, and thus adjust more slowly? Are you trying to imply a certain conclusion from that observation? Or just noting it for clarity?

1 Like

What I’m saying is that if the initial rating on Lichess is 1500 Elo for everybody, this is not a big problem since the rating quickly becomes accurate. Whereas having a wrong initial rating on a go server creates a more unpleasant user experience. Therefore go servers need to make more effort guessing the strength of a player accurately.

1 Like

Oh, interesting, I was just assuming the categories were (already) used for matchmaking. Checking the code, they’re not!

Only the overall rating actually is used for matchmaking as far as I can tell. I’m not proposing a change to that.

In which case, the stakes here are super low, since only the overall rating actually does anything.

(Probably it would be nice to make these categories correct enough that they could be used for matchmaking, and then, maybe it would be smart to do so, but I’m not proposing that here.)

5 Likes

That’s exactly why I chose such a large number, so that it was clear that the ratings were well-established :). 1000 seemed “big enough” to be convincing.

2 Likes

Yeah, interesting… probably time-since-last-played is a more relevant indicator of the staleness of a rating category.

From my understanding of @BHydden’s post a while back, those categories were once separate:

@BHydden also once mentioned that anoek tested having a combined rating for all sizes and time settings and found it beneficial:

@Groin also mentioned something similar:

1 Like

In addition to the 2021 thread mentioned in those quotations, Anoek published a summary analysis of using overall ranks to predict different speed and sizes in this rating update thread.

I don’t have the statistical skills to deeply analyze any of it, but I feel there should be some matchmaking consideration for people whose skills on on board/size deviate far enough from their overall. Perhaps if a player’s overall and category rank are statistically different and the category rank meets some standard deviation criteria (is it firm enough) matchmaker could use some formula to blend the ranks.