Rank system and 9x9 handicaps problem

square_fuseki · May 30, 2024, 8:15pm

more than half games are won vs sdk bot.
Instead rank continue to go down and became 23k
glicko gone crazy?

update: looks like those are 9x9 games with handicap stones. So almost 100% win-rate is expected from user.

But, I think something should be done to protect rank of users who choose strange game settings.

square_fuseki · May 30, 2024, 8:41pm

related?

GreenAsJade · May 30, 2024, 11:23pm

How is “playing 9x9 with handicap” a strange setting?

This is a genuine question: I literally do not know what is supposed to be normal or not with 9x9 and handicap.

The data above on face value doesn’t say “the rank system is broken” it says “9x9 handicaps aren’t working”.

Uberdude · May 30, 2024, 11:43pm

In the context of “should this game be unranked to prevent messing up the rank system”, I think a sensible definition of “yes, that’s a strange setting” is a game with an expected winrate very high/low, say >95%. I always thought the old rule of ranked games needing to be between within 9 ranks was backwards:

A 10k playing a 10k even is a sensible game where the outcome is near 50-50 so gives useful information to the ranking system and won’t get peturbed by non-game-skill things like bad internet causing timeouts.
A 10k playing a 1k with 9h is similarly sensible as near 50-50.
A 15k playing a 1k with 14h is just about as sensible as near 50-50
A 10k playing another 10k with 9h is silly as expected winrate is 100-0 so ranking system gets little useful info.
A 10k playing a 1k with no handicap is also silly as 100-0.

It’s not the rank difference of the players that’s the thing to test, it’s the effective rank difference after accounting for handicap.

square_fuseki · May 30, 2024, 11:53pm

so, universal solution is: if expected winrate is >95%, any game should be automatically unranked. Because in real world timeouts, beer and other disturbances are possible and actual winrate never would be 100% even vs 25k

Groin · May 30, 2024, 11:58pm

I played a game with this bot, which took 2 stones automatically on 13x13. Nothing special to report, got a fair game, won by 2.5.

I am amazed by the recent changes on handicap calculation for smaller boards, it’s great.

GreenAsJade · May 31, 2024, 1:09am

Hmm - if the handicaps are “right” (and actually, the win/loss record in the OP at face value seem to say they are), then it IS the ranking system that is not handling it properly.

It’d be great to have @dexonsmith assessment of this.

qnpnpmqppnp · May 31, 2024, 6:44am

Agreed on principle. I’m just not sure how robust the handicap system is when used in such extreme settings. I’d be surprised if this is truly 50%.

gennan · May 31, 2024, 9:25am

I suppose it doesn’t have to be 50% to have a fun and competitive game. I’d consider anywhere between ~25% and ~75% as good enough for that (rating gap corrected for handicap is smaller than ~200 rating points).

OGS ranks can be pretty volatile so the rank gap between an OGS 1k and OGS 15k may already have error bars of a couple of ranks (~100-200 rating points). I think it will be challenging to accurately measure the relation between rank gap adjusted by handicap and winrate, also because the rank gap will change if the winrate becomes somewhat skewed.

If there is a worry that ranked games with high handicap might mess up the rating system, you could perhaps add a weight factor to reduce the size of rating adjustments depending on the handicap. The EGF rating system does this rather abruptly (weight is 100% up to 9 handicap, 0% with more than 9 handicap), but you could use a smoother weight function.

Samraku · May 31, 2024, 9:35am

I don’t think games should be unrated simply for having a high expected win rate for one side or the other: any decent rating algorithm will already account for this

The reason >9 stones handicap are not rated, is because it’s estimated that around that point handicap stones start diverging too much from the effect they’re meant to have

gennan · May 31, 2024, 9:53am

Indeed it should if the game result matches the expected result. But when a 5k plays a H9 game against another 5k, they are expected to win by almost 100% because the effective rating gap is maybe 600+ rating points. In theory that expectation would be correct, but in practice there are mishaps outside gameplay (like a bad internet connection, accidental timeout in byoyomi) reducing the theoretical effective rating gap.

Then again, I don’t think it is a big problem, because those mishaps also impact even games between very unevenly matched players. Worst case is that mishaps in very unevenly matched games have twice the impact of mishaps in evenly matched games. As long as players play much more evenly matched games than unevely matched games, it shouldn’t be a problem I think.

Samraku · May 31, 2024, 10:04am

If the rating system is good, such statistical noise will not be an issue. I am very much in the camp of picking a good rating system, and letting it do its work

gennan · May 31, 2024, 10:17am

I think the current rating system is already a good rating system. I think the only downside is that it is a bit complicated.

Samraku · May 31, 2024, 10:21am

I think it will be much improved with the next ratings update, but yes

BHydden · May 31, 2024, 11:41am

Imagine the things anoek could do if people didn’t keep insisting he rebuild the rating system every 3 years

gennan · May 31, 2024, 11:45am

I’m not aware that there is currently much pressure to update it again (at least not the core rating system algorithm or the player ratings)?

benjito · May 31, 2024, 12:56pm

I think @dexonsmith is working on a v6 of his own accord

hexahedron · May 31, 2024, 1:27pm

For the mathy nerds out there, this is why all else equal, in somewhat noisy environments like online play, it’s usually probably preferable for rating systems to use likelihood functions with singly exponential tails and NOT use quadratically exponential tails.

You get singly exponential tails implicitly in most Elo-based systems from the use of sigmoids or the logistic distribution in the foundations of the math, and sometimes explicitly in systems that use explicit iterative bayesian optimization with these kinds of likelihood functions. So most standard systems do get this for free.

You get quadratically exponential tails if the mathematical foundations of the rating system models a player’s performance in a given game as a normal distribution, such as things like “player’s rating is the mean of the normal distribution, and a game is modeled as both players selecting a number from their distribution and the higher number wins”, or if you use the normal CDF as the curve that the system considers to represent the chance of winning a game as a function of the rating difference. The normal distribution falls off as exp(-x^2), i.e. exponential of a quadratic.

When you have quadratically exponential tails, then a rating system will tend to behave in a “mean-like” way with respect to game results in the tails - i.e. the final rating is like the average rating that each individual result implies. So a sufficiently big outlier mishap like internet disconnection on a game you should have been extremely likely to win can have a continued major effect in the estimated rating the same way that a big outlier can skew the average value of a sample. (At the root, this is because quadratic tails ~= squared errors => solve for player ratings to minimize sum of squares of errors => the mean is the value that minimizes this).

By contrast, if you have singly exponential tails, then a ratings system will tend to behave in a “median-like” way with respect to game results in the tails, i.e. the final rating is kind of like the median rating that each individual result implies, at least as far as outliers are concerned (most good systems do better yet, behaving like an average for non-outlier results, but like a median with respect to outlier results). So an isolated outlier will at most just be one additional game on that side of the median, rather than having outsized impact. (At the root, this is because linear exponential tails ~= absolute errors => solve for player ratings to minimize sum of absolute errors => the median is the value that minimizes this).

And that’s our nerdy math aside for the day.

benjito · May 31, 2024, 1:53pm

Lol I didn’t even realize the changing distribution was an option. Are we currently using the quadratic exponential model? Is there a popular rating system using the singly exponential tails?

I guess cutting the tails off (unranking large mismatches) is one way to deal with this problem

square_fuseki · May 31, 2024, 2:03pm

but [?] users still should be able to get rank from any game