Some speculation on my part:
It seems like OGS’ rating system has a greater rating volatility than other systems. It looks like it’s not really exceptional that a player’s rating fluctuates over a range of several ranks (perhaps this is a side effect of the system being tweaked for ratings to converge quickly for newcomers, from OGS’ universal start rating towards an established rating in the right ball park?).

Either way, when the system calculates automatic handicaps from relatively volatile ratings of both players, this can easily lead to a sense of the automatic handicap being wrong in many games (either too much or too little handicap, thus failing to even out winning chances).
I feel it’s not handicap that should take the blame here, but rating volatility.

But the combination of the OGS rating system with the handicap system actually seems to be pretty solid, as it keeps the winrate of Black at around 43% in games with less than 5 stones of handicap. The statistics (or some statistics) are discussed here. According to that post, 43% is also the winrate produced by the handicap/rating system the EGF adopts.
And in this very topic (EDIT: no longer after the fork), reply 42, I proposed two conjectures as to why that might be the case

43% is a good winrate for handicap games around 1d level (side note: I think it should be closer to 50% in the DDK - TKP range).
But this 43% average can be reached even when relatively many games have too much or too little handicap. The one doesn’t exclude the other.

I agree, but at the same time, I can’t personally think of any better measure of how solid a handicap/rating system is than the average winrate. Although funnily enough there’s someone that might help: @Allerleirauh, who just replied here (EDIT: no longer after the fork xD), seems to have a hobby of compiling statistics about OGS games. Perhaps he could help?

Neither can I.
My speculation was just that the perception of the quality of automatic handicaps can suffer when the rating system has a rather large rating volatility overall.
I think it should be possible to measure this volatility and compare different systems by this metric.

I completely agree on this, especially the idea that the volatily causes a perception of the handicap being unfair – regardless of whether that perception is true or not.

Although another question that remains unanswered is whether the volatility of OGS ranks is more of a bug or a feature, as it might also be an accurate representation of a player’s general skill floating depending on their mood and other variables.

For the hundredth time, I will propose that displaying a rank calculated from a strongly smoothed out version of the rating might be a preferable system, even while keeping the current rating system for the matchmaking.

In fact, you know what, OGS could try even using such a rank to calculate the handicap in a “control group” experiment to test the hypothesis we talked about (that the average winrate is an artifact arising from excessively high and low handicaps canceling out); wait, I guess it’s basically what you just proposed too

I don’t see how the average winrate can measure the quality of the rating and handicap system.

The more chaotic the rating is, the closer to 50% I would expect the winrate.

In the extreme, if the pairing system was completely random, choosing white and black randomly and setting a random number of handicap stones between 0 and 9, then I would expect the average winrate to be very close to 50%.

It’s not quite that simple, although in the extreme case that’s probably true.

If a system keeps the average around 50%, at the very least it means it’s unbiased. I believe that’s the definition of unbiased, a mathematical function that produces the “correct” expected value.

In fact, we can say that both the EGF and the OGS handicap/rating systems are in fact somewhat biased, because an ideal handicap system would produce a winrate of 50% for Black instead of 43%.

When everything is completely random, then you can’t measure anything.
But when you use handicaps determined by the rating gap between players, you would be able to see when something is completety off: especially in higher handicap games, the winrates for white and black would be skewed far from 50%.

Handicapping for 50% expected winrate is not very hard to do: just use komi also in handicap games, so with 1 rank difference this means 2 stones with black giving komi, instead of just getting black without komi. Also, maybe use Chinese/free handicap stone placement.
But this would break with the tradition of handicap.

For rating systems in chess (so no handi) it makes sense to use binomial deviance* that has a lot of other names. I’m not entirely sure how you would use it for handicap.

Your experience may well be true, but your experience of being a dan-level player (~top 5%? of OGS players) can be very different from that of a player at OP’s level, or @Atorrante 's level. To propose a mere speculation, perhaps lower-level players choose to play you to challenge themselves and thus want no handicap?

The point is, I think the OP is right that handicaps can help beginners learn the game and I think we should encourage him to request handicaps in his games. Even if he gets canceled, he can just create another challenge.

As @ArsenLapin1 pointed out, the average winrate doesn’t show if there’s lots of both too low handicap and too big handicap, because they cancel out each other.

Maybe it would be interesting to see, how well the rating system would do if it was betting on itself. It does make finer distinctions within the range of one rank, so it could say “I expect black’s chance to win is 45 %” and if black wins it would get 0.45 points. If white wins it would get 0.55 points.

I imagine you meant “If white wins it would lose 0.55 points”.

Yeah, I think that might be a good test!

(to be honest my initial reaction was “wouldn’t it just get 0.43 points in the end?”, but I think that would only be the result if the system always predicted a 50% winrate in the specific case of handicap games)

Now we just, uh, need to figure out the mathematics from the Glicko-2 system and actually write code to collect the data.

Oh, ok. Then I guess you need to either divide by the number of games, like an arithmetic mean, or some other kind of average, or otherwise adjust the expectation based on the number of games.

Intuitively the sum of those points should be greater than half the number of games if it gets most predictions right, less if it gets most wrong. If it’s close, uh… I have no idea. I’m starting to think we need a better test or maybe just someone who knows more about statistics than me!

EDIT: maybe I got something. If you group those predictions in buckets of comparable probability (e.g. all the ones where 0.55 <= p < 0.56), you can see in what percentage of games in that bucket the rating system guesses right, and if it’s good, you’d expect the percentage to be equal to the prediction.

When using winrate as a measure of how solid a handicap/rating system is,
one should use the winrate for each handicap individually, rather than
lumping them all together, to avoid what ArsenLapin1 pointed out.

A trivial betting-agent would say 50:50 to every game and get 0.5 points for every game. A smarter agent would bet 50:50 for even games and 43:57 for handicap games and get 0.51 points per bet in a handicap game. OGS should get even more points, because it knows in 9.9k vs 10.1k game, the 9.9k has a slightly better chance of winning.

Luckily, the vast majority of them are around 43%, plus or minus 5%, judging from the graphs.

(Here's the graphs)

Oh, you know what. Looking at these graphs, it seems like the very first graph might be very close to what we were talking about in the last few messages. Though I’m not exactly sure how to interpret it.

When I said that handicap winrates in the TKP range should be closer to 50%, I meant theoretically.
Both OGS and EGF data seem to show otherwise. In reality handicap winrates for black even seem to drop below 40% in the TKP range.

My interpretation for this phenomenon is that perhaps the rating range of the game of go should extend further down than ~25-30k, maybe 40-45k?

But it’s difficult to find a public data source on how much handicap a raw novice needs against (say) an established 20k. I suppose it’s difficult to establish what a “raw novice” even means. A random 5 yo raw novice is not the same as a 20 yo raw novice who is also a university student in mathematics. Also, raw novices may not play much on 19x19, so perhaps there won’t be much data on this at all.