Thank you for your response, and for your work on this issue.
I am not saying that the original winrate prediction function was correct. My main point was that there is a specific systematic bias in your analysis of data that does not average out. Statistics is tricky, and sometimes small biases can have a big effect. Or incorrect data of a few % of samples can have big impacts.
“When we average overall games (not players), we lose that noise and we can just fit a smooth winrate function.” Yes, but you also acquire a systematic shift in the winrates. For instance in the case that SD is 50 Gor, you will get cca 15% of games between even players, which will have the Gor difference of more than 100, so the effect will be that it seems that between players 100Gor apart, the weaker player wins more than it is actually the case. The main underlying problem is that in statistics <f(x)> is not f( < x >), where <> denotes the averaging.
One could estimate this effect on the winrates using a simple numerical simulation. But you could also use the data from players with established rating (in some period), compute the average rating of each player, and then put this rating to all his games. So the Gor difference of the game would be calculated from the averaged ratings, not from the ratings at this specific point in time.
Your argument that most games are by correctly rated players is sensible, but one would still need to estimate the number of incorrectly rated games (and not the number of incorrectly rated players, as you correctly pointed out). (In my probably atypical experience this would still be 20% though). Furthermore, it is not clear how big of effect even a small % of incorrectly rated games makes (my intuition is that for big Gor differences it could be noticable).
Yes, I think you should compare the results of all games with the results of “filtered games”. If there is not much of a difference, the effect is small. If the difference is noticeable, then one needs to think further about how to actually filter the data in an optimal way.
As for filtering, I would suggest, for example, including games between players where for both players it holds that:
a) They played at least 20 rated games in the last year
b) Their rating in that period didn’t change by more than something like 35 Gor