Are OGS rankings inflated, deflated or neither?

Conrad_Melville · December 29, 2022, 3:12pm

Whom are you addressing? Nothing you have written here has anything to do with anything I said.

Self-identifying as a beginner at registration (i.e., with some kind of special marking aside from a rank) sounds like a great idea. I don’t recall it ever being suggested before.

Groin · December 29, 2022, 3:14pm

That was answering @espoojaram, but seems the link didn’t work. sorry for the annoyance.

Conrad_Melville · December 29, 2022, 3:14pm

Okay, thank you.

Groin · December 29, 2022, 3:24pm

I gave that suggestion, i think it’s linked directly in the first thread of the compendium.

espoojaram · December 29, 2022, 4:38pm

…obvious to who? On a 9x9 game, I can barely assess my opponent’s strength as a 15 kyu, and by “assess” I mean mostly “if they get an easy life&death problem wrong, I know they’re inexperienced”.

By the way, there’s another little flaw in the idea, though it’s not a big deal.

Technically, we can’t actually promise this (“fair games”), because of what I said here:

which means even in an ideal system where all beginners can self-identify and play with other self-identified beginners and there’s no sandbaggers, many beginners will be significantly weaker than the average of beginners, and they’ll have to go through the same exact rigamarole of losing games until they find the appropriate rating.

If the sandbagger problem is contained, it will usually be a faster and more painless process than starting at 6 kyu, but they’ll still get kind of stepped on.

Though this is a great counter-point:

But on the other hand, since beginners typically don’t have familiarity with etiquette, I expect the on-steppers won’t be very polite in chat while on-stepping.

But I’m probably supposed to be the one who doesn’t care about beginners, so I guess I won’t care?

Again, I’ll support the plan if you people ever stop walking in circles and actually start taking action (if anyone’s wondering, this is a jab at the fact that there are proposals for this kind of system at least as far back as 2018, and every time someone was like “but Glicko-2 requires 1500” and every time someone else correctly was like “actually no” and everybody was like “oh, let’s do it then” and here we are in December 2022).

I’m genuinely not sure what the results will (would?) be.

I did perhaps think of a way to somewhat “measure” the success of the system though: if we compile statistics about abandoned young accounts, with a self-identifying beginner system, I’d expect to see most of the alt-sandbaggers moving up in the ranks, while the accounts of discouraged beginners should either stagnate or go down in rating before being abandoned.

With our current system, discouraged beginners should be the ones that go down and then abandon, while I imagine alt-sandbaggers will win at least a few games before abandoning.

So, of course it’s not a rigorous thing, but if we measure those quantities, maybe that will give us an idea of which system is better.

Maharani · December 29, 2022, 9:24pm

I know you’ve already linked one example, but is there really a significant amount of 26+ kyu players? The lowest rank I ever “achieved” was 25k (in today’s rating conversion - back then, it was 40k). I knew nothing about the game except that black and white take turns, and that chains with only one liberty can be captured (although I didn’t know that this was true even if the liberty was an eye on the inside). Until the example you showed, I’d never even seen a rating significantly lower than 640!

espoojaram · December 29, 2022, 9:51pm

According to the usual histogram, it’s about 2% of the current player base on OGS, though most of them are above 32k – and if the hypothesis that the current rating system discourages weak beginners is correct, there may be many more that never even get to their rank and don’t appear in the statistics. Other than this, I have no idea

shinuito · December 29, 2022, 11:40pm

It used to be a thing and it lead to lots of problems, I believe something like ratings drift and inflation and other things but I’d have to dig up some old threads

espoojaram · December 30, 2022, 9:06am

I (thought I) knew this and before now my opinion was “but Glicko-2 is exceptionally adapt at solving those problems (that I was aware about)”, but the specific point brought up in the linked post is extremely good

@Groin: that should be added to your compendium (unless your intention is to deliberately only gather topics that support your position and ignore ones that don’t, I guess ). Or actually, maybe just add this one we’re writing.

espoojaram · December 30, 2022, 10:10am

Oh, actually, I believe Groin’s proposal (if tuned appropriately) mostly takes care of the points anoek brings up there:

You can set up a “beginner room” where new accounts (if they self-identify as beginners) can play their first, say, 10 to 20 games. Then they can get introduced in the bigger pool of established players with a provisional tentative beginner rating based on their performance in the room (like, between 20k and “30k”, for example).

Because of this:

It really seems that we don’t have an unmanageable influx of newcomers, much less beginner newcomers.
While on one side this is a shortcoming of Groin’s proposal, because it means beginners can’t always play other beginners, this actually means we probably have the resources to monitor most, if not all, games going on in that room, and recognize the sandbaggers when they show up.
(We could implement this idea incrementally to make sure we have the resources: first have the beginners play only a few games there, and increase periodically to see how many we can handle)

Or in other words, if we have volunteer (hopefully trusted) teachers, and mods when they can, hang out in those rooms too, they can recognize the fake beginners, and hopefully even explain what happened to the real beginners that have been sandbagged.
If there’s only one beginner, they play their first games with a volunteer teacher.

I believe this also solves the other little problem I was worried about, that beginners playing unsupervised might be rude to each other.

Hopefully, this means the sandbaggers will be discouraged by how much of a pain it is to pretend to be a beginner while being monitored for so many games, and with time they might be a negligible phenomenon.

Wanting to nitpick, there is still a potential problem: TPK sandbagging might become a more pronounced phenomenon, and it will be difficult even for teachers and mods to recognize them. But hopefully in that case the experience won’t be particularly traumatic (for the reasons brought up by @meowkorkor in the bullet list above).

Now, I guess I should sugar-coat this as much as I can, but I can’t resist: one objection I wouldn’t like to hear is “but this could muddy up the statistics”. Yes, there are still ways that could happen (although this system imho would contain it a lot more than other proposals). I think it’s been established that the forum community believes it’s worth it to improve the beginner experience.

Other than that, I invite everyone to point out all the flaws they see in this proposal, so we can determine if it’s worth anything or if it can be improved (I imagine it can)

(I had another idea first, but the above is much better.)

Here’s an idea that might solve, or at least contain, those problems, and it would help a bit with the sandbagger problem too: you can alternate matchmaking based on the self-declared rating and the “normal” rating.

This revolves around the proposal of assigning two ratings, a “median” one and a self-declared one (with limitations*), to newcomers, and declare them “ranked” when these two converge.

Alternating the matchmaking means the true beginners will only meet stronger players half of the time, and sandbaggers will have a comparable (well, about half the) amount of work to do relative to the current system.

*limitations being, it might be bad for the rating (self)assigned to always be the same, as it might lead to skewing and drifting of the rating pool mostly due to dishonest newcomers, so it might be better if the "self-declared rank" was really more of a range, and then the newcomer is assigned a uniformly random rating within that range, or something like that. Also, don't allow people to set their provisional rank as an extreme of the distribution unless they can certify it.

Maharani · December 30, 2022, 10:14am

The fact that @anoek seems to be in a coma…

teapoweredrobot · December 30, 2022, 1:08pm

It is Christmas/Twixmas and anoek doesn’t get much time off as far as I can tell!

I’ve probably not followed all this as much as I should and I’d be keen to make ogs as welcoming to beginners as possible, whole also recognising that it also needs to be attractive to stronger and especially dan players. It seems to be a difficult problem by its nature. I just wanted to understand how this is envisaged to work

my immediate thought is that this sounds like a significant reprogramming effort but I’m definitely not an expert in such things.

meili_yinhua · December 30, 2022, 1:52pm

Okay, I’ve decided to poke my head in here, and it seems inevitable we start talking about the glicko-2 entry point.

Frankly I think if the matchmaking would actually make use of the “humble rank” during its term of use, that would continue to be a nice compromise…

But if you really want to get into the nitty-gritty of choosing entry points, one would need to be able to ascertain which information to take in, and how reliable that information is.

For example, when you allow players to self-identify, does that self-identification, on average, match the resulting rank that the player is expected to have? And how large is the standard deviation of the distribution of “true ratings” of players that have identified such? Should players be allowed to not self-identify and instead start with the classic initialization parameters that are tuned for the assumption that we don’t have better information? If we use tsumego, how reliable are the results actually? After all there’s a joke about all go problems being mislabeled as tradition

One could also imagine a “beginner room”, but I don’t imagine that if there’s already a protocol in place to prioritize provisional players playing against each other that this would be very helpful as it would likely prolong waiting periods for their games, and I would imagine that extra wait could potentially make it feel to that beginner to not be worth trying out this game at all. Getting smacked around is one thing, not even being able to play is another.

Now, on these here forums we’ve talked the insertion point up and down the wall multiple times. There has been a clear feeling even from the first implementation of Glicko-2 that retention of new players seems a little bit difficult when you need to lose several games to hit your rating. One could also imagine the potential of trying to examine how many players we lose to this, since a lot of the evidence we’ve given is either anecdotal or theoretical. I’ve had many more players that I’ve been hesitant to send straight into OGS than ones I’ve seen give up after the early losses, and while there’s a bit of survivorship bias present in this also very small sample, it’s also better evidence than “losing will make them have a bad experience”

meowkorkor · December 30, 2022, 2:46pm

Two more questions:

A new player (who is actually TPK but OGS treats as 6 kyu) plays a game against a “real” 10 kyu. If the 10 kyu wins (as expected), does the system give their rank a large boost for beating a “6 kyu”?
Is it actually necessary for new accounts to be assigned a starting rank? Could they simply be treated as having no rank?

jlt · December 30, 2022, 3:00pm

I think the new players is treated by OGS as a 12k but the rank is considered as uncertain, something like 12k ± 5. So if the new player loses against a solid 16k, then the rank of the 16k will increase, but not as much as if the 16k had won against a solid 12k (rank 12k ± 1).

You can unranked games if you want. I don’t understand what you mean by “treated as having no rank”.

espoojaram · December 30, 2022, 3:15pm

It seems @meili_yinhua already anticipated this, but for the sake of the discussion I’d like to explicitly point out that as far as I can see the system I described (the supervised beginner’s room) can very easily be implemented to have a fallback to the current system in case there aren’t enough suitable players (both beginners and teachers) around. (In fact, I should edit in some more details about how this allows to implement it gradually).

Obvious response, probably anticipated: this is speculation (the idea that having to wait a lot is worse than losing, I mean), and it’s a hypothesis that I’d guess to be wrong.

As far as the testing of such a hypothesis goes, I proposed this in my second-to-last novel:

Click to see quote

This was for a naive self-declared rank system, but I think it can be adapted for a beginner room system. Do you think this idea has potential?

Replies to some of meili_yinhua's points

Does that matter? I all the past threads I read, you were usually the one pointing out that it’s not a requirement for Glicko-2 to fix the entry point

Edit: I realized “fixing the entry point” and “fixing the entry point at 1500” are two different things, so maybe you’re saying the first one is important for Glicko-2. I’ll wait for your reply.

I would assume it’s fine to give them the same starting deviation as all other players (350 – in the case of the beginner room, this is even the deviation I would assign to players coming out of the room), if that’s what you meant.
This is especially useful in case they’re a sandbagger or a troll.
I don’t understand why it should be a problem, since we even allow players to have multiple accounts, each of them starting with a new provisional rank and 350 deviation. If it was a problem to not use all the info we have about each starting players, then the current system should be suffering from this?

My opinion is yes. The current system (as long as we fix the graphical glitches causing confusion) is very good for most players who are not at the extreme ends of the spectrum of strength. Once we take care of the self-identified beginners (and perhaps the self-identified dans?), everyone else should start somewhere in the middle of the distribution. 6 kyu is the mode (the rank most common among established players on the site).

Funnily enough, the user that inadvertently gave me this idea thought it would be even easier to code than self-declared ranks. But they’re not experts either.

And I'm no expert either, but here's my impression if anyone cares.

On one side, we have an automatching system that follows a “priority” system, so matching “marked” (beginners/teachers) players shouldn’t be too difficult. Making sure that the rating of the new players is reset when they exit the beginner room might be more annoying. Creating a way for assigned teachers and mods to specifically monitor the games in the virtual room is probably the most difficult part.

Answers to meowkorkor's questions

The evidence I’ve seen so far convinces me that the starting rating is most likely actually 1500 (equivalent to 6 kyu), but it’s hidden in the graphical display by the “humble rank” system.
The rest is correct, since new players have a very uncertain rank, winning or losing against them doesn’t affect established players much, assuming Glicko-2 is implemented correctly in that aspect.

As far as the matchmaking goes,

the system sort of already treats them as unranked. According to what the dev (anoek) said in the thread @shinuito linked above, the system tries to match “?” players with other “?” players, and only tries something different if that fails.

As far as the rating system itself goes,

we currently use a system where we rank players based on how they do against other players. I believe the mathematics of the rating calculation wouldn’t make sense if the player didn’t have a starting rank – you would have to resort to more complicated probabilistic models to guess the rating.
The system OGS uses is adapted from a system called “Glicko-2”, which in my opinion is much more elegant than that, because it can treat a strarting player with the same exact algorithm as any other player, but that requires some starting guess. The current guess is the most common rank among current OGS players.

As I said before, it doesn’t make sense (from a user experience standpoint) to show that starting rank, anywhere, because even if it’s the best guess, it’s more likely to be wrong than it is to be right.

paisley · December 30, 2022, 3:17pm

Regarding “treating new accounts as having no rank,” I see no theoretical or technical reason to stop an implementation of Glicko-2 from cranking the initial rating deviation (RD) up very high. For technical reasons, you still have to assign an initial rating (which can of course be displayed as “?”), but with a deviation of “± 40 stones” or similar, that’s essentially equivalent to not assigning a rank at all.

If my memory serves, I believe the reasoning for the initial rating deviation of 350 Elo points is that this is the value chosen by default in the Glicko-2 documentation. There was discussion that changing this might lead to unexpected consequences. I don’t believe there are actually negative consequences, but I’m of course not 100%, and I certainly understand that hesitation, because Glicko-2 is quite complex.

Edit: one disadvantage of cranking the RD like this would be that, after their first game, a new player would have a truly wild rank. If they win their first game, it might wind up being something like “20 dan”, and something like “50k” if they lose. These could of course still be displayed as “?”, but if that info is used for matchmaking, it would certainly be a thing. This would probably work better with something like the “beginner room” proposal, with custom-tailored matchmaking for “?” players.

espoojaram · December 30, 2022, 3:30pm

Keep in mind that a standard deviation of 350, for a Gaussian distribution, really represents an “uncertainty” of ± 1050 (3 sigma), reasonably speaking. Meaning the 99% confidence interval for a provisional player is between about 500 and 2500, or 25 kyu to… 9d+?

You realized yourself the drawback of a cranked up deviation

Other thing to keep in mind: the current system works quite well for most players. Could the deviation be fine-tuned for better performance in the initial ranking? Maybe, but certainly not a priority.

paisley · December 30, 2022, 3:32pm

I agree. I was simply pointing out that it could be done, to answer @meowkorkor’s question.

meowkorkor · December 30, 2022, 3:34pm

I see. There is one feature (or perhaps bug) that contradicts this. Before I completed my first game (I play correspondence, so I started many games before completing one), when I created a custom challenge, I could not restrict rank to below 15 kyu or above 3 dan.

One chess website handles games between an unrated player and an established player in an interesting way. They influence the eventual rating of the unrated player, but have no impact on the rating on the established player. The rationale is:

Any result against an unrated player provides no useful information on the strength of the rated player.
To encourage established players to play unrated players without fear of losing rating points.

More importantly, when it is wrong, it misleads players in a way that spoils their experience.

A beginner who knows the rules but does not know how Go ranks work may mistakenly think 6 kyu is a reasonable rank for them and they would have a fair chance against a “real” 6 kyu.
It is not fun for a “real” 6 kyu to start a game, thinking they face a fellow 6 kyu, only for their opponent to play out basic ladders.