In the context of “should this game be unranked to prevent messing up the rank system”, I think a sensible definition of “yes, that’s a strange setting” is a game with an expected winrate very high/low, say >95%. I always thought the old rule of ranked games needing to be between within 9 ranks was backwards:
A 10k playing a 10k even is a sensible game where the outcome is near 50-50 so gives useful information to the ranking system and won’t get peturbed by non-game-skill things like bad internet causing timeouts.
A 10k playing a 1k with 9h is similarly sensible as near 50-50.
A 15k playing a 1k with 14h is just about as sensible as near 50-50
A 10k playing another 10k with 9h is silly as expected winrate is 100-0 so ranking system gets little useful info.
A 10k playing a 1k with no handicap is also silly as 100-0.
It’s not the rank difference of the players that’s the thing to test, it’s the effective rank difference after accounting for handicap.
so, universal solution is: if expected winrate is >95%, any game should be automatically unranked. Because in real world timeouts, beer and other disturbances are possible and actual winrate never would be 100% even vs 25k
Hmm - if the handicaps are “right” (and actually, the win/loss record in the OP at face value seem to say they are), then it IS the ranking system that is not handling it properly.
It’d be great to have @dexonsmith assessment of this.
Agreed on principle. I’m just not sure how robust the handicap system is when used in such extreme settings. I’d be surprised if this is truly 50%.
I suppose it doesn’t have to be 50% to have a fun and competitive game. I’d consider anywhere between ~25% and ~75% as good enough for that (rating gap corrected for handicap is smaller than ~200 rating points).
OGS ranks can be pretty volatile so the rank gap between an OGS 1k and OGS 15k may already have error bars of a couple of ranks (~100-200 rating points). I think it will be challenging to accurately measure the relation between rank gap adjusted by handicap and winrate, also because the rank gap will change if the winrate becomes somewhat skewed.
If there is a worry that ranked games with high handicap might mess up the rating system, you could perhaps add a weight factor to reduce the size of rating adjustments depending on the handicap. The EGF rating system does this rather abruptly (weight is 100% up to 9 handicap, 0% with more than 9 handicap), but you could use a smoother weight function.
I don’t think games should be unrated simply for having a high expected win rate for one side or the other: any decent rating algorithm will already account for this
The reason >9 stones handicap are not rated, is because it’s estimated that around that point handicap stones start diverging too much from the effect they’re meant to have
Indeed it should if the game result matches the expected result. But when a 5k plays a H9 game against another 5k, they are expected to win by almost 100% because the effective rating gap is maybe 600+ rating points. In theory that expectation would be correct, but in practice there are mishaps outside gameplay (like a bad internet connection, accidental timeout in byoyomi) reducing the theoretical effective rating gap.
Then again, I don’t think it is a big problem, because those mishaps also impact even games between very unevenly matched players. Worst case is that mishaps in very unevenly matched games have twice the impact of mishaps in evenly matched games. As long as players play much more evenly matched games than unevely matched games, it shouldn’t be a problem I think.
If the rating system is good, such statistical noise will not be an issue. I am very much in the camp of picking a good rating system, and letting it do its work
For the mathy nerds out there, this is why all else equal, in somewhat noisy environments like online play, it’s usually probably preferable for rating systems to use likelihood functions with singly exponential tails and NOT use quadratically exponential tails.
You get singly exponential tails implicitly in most Elo-based systems from the use of sigmoids or the logistic distribution in the foundations of the math, and sometimes explicitly in systems that use explicit iterative bayesian optimization with these kinds of likelihood functions. So most standard systems do get this for free.
You get quadratically exponential tails if the mathematical foundations of the rating system models a player’s performance in a given game as a normal distribution, such as things like “player’s rating is the mean of the normal distribution, and a game is modeled as both players selecting a number from their distribution and the higher number wins”, or if you use the normal CDF as the curve that the system considers to represent the chance of winning a game as a function of the rating difference. The normal distribution falls off as exp(-x^2), i.e. exponential of a quadratic.
When you have quadratically exponential tails, then a rating system will tend to behave in a “mean-like” way with respect to game results in the tails - i.e. the final rating is like the average rating that each individual result implies. So a sufficiently big outlier mishap like internet disconnection on a game you should have been extremely likely to win can have a continued major effect in the estimated rating the same way that a big outlier can skew the average value of a sample. (At the root, this is because quadratic tails ~= squared errors => solve for player ratings to minimize sum of squares of errors => the mean is the value that minimizes this).
By contrast, if you have singly exponential tails, then a ratings system will tend to behave in a “median-like” way with respect to game results in the tails, i.e. the final rating is kind of like the median rating that each individual result implies, at least as far as outliers are concerned (most good systems do better yet, behaving like an average for non-outlier results, but like a median with respect to outlier results). So an isolated outlier will at most just be one additional game on that side of the median, rather than having outsized impact. (At the root, this is because linear exponential tails ~= absolute errors => solve for player ratings to minimize sum of absolute errors => the median is the value that minimizes this).
Lol I didn’t even realize the changing distribution was an option. Are we currently using the quadratic exponential model? Is there a popular rating system using the singly exponential tails?
I guess cutting the tails off (unranking large mismatches) is one way to deal with this problem