New EGF algorithm implemented

The EGF rating system commission’s assignment is mostly over (it’s not a permanent commission) and I don’t know of any new proposals to overhaul the EGF rating system again.

But you might want to contact the Belgian Go Federation. They have their own rating system and the new chairman Michael Silcher might want to propose further changes to the EGF rating system (he was also involved in the commission and he was a very fervent proponent of lowering the rating floor of the EGF rating system).

1 Like

It depends a bit on how one counts players, but this gives you an idea:


Thinking a bit more about this update I realised I have a question.
Namely, if I understand correctly the parameter a was updated based on empirical data?

I find this problematic, if it is not done carefully.
The empirical a would be systematically incorrect (in the way it was corrected now) if the game results contain wrongly rated players. From my experience, around 20% of players in the range 4k-4d have their ratings wrong by more than 150 Gor. I believe this would be a big problem in using the empirical data. Or was the used data of only matches between established players, e.g. with at least 5 tournaments in last year or something similar?

Even in the case of using players with established rating (edited, before I incorrectly wrote rank), there is some standard deviation in players’ ratings. Probably we can estimate it from the parameter a itself. I believe even this could be enough to skew the empirical data of the parameter a.

The effect of changed parameter a is that the results of under/overratedplayers are less surprising, thus they stay at incorrect rating longer. (I.e. more tournaments are needed for players to get their correct ratings). But if it is indeed more correct, it should better rate the players who play mostly weaker players.

The winrate statistics were collected using rating differences, not rank differences. So we only assumed that predicted winrates for given rating differences should be consistent with observed winrates for those rating differences.

Sorry, above I incorrectly wrote established rank instead of established rating. The algorithm only deals with ratings, so my whole post is only about rating.

To illustrate my point with a made up example:
let’s say we look at the results between players 200Gor apart, and we observe that the weaker players win in 20% of the cases. If the sample contains 80% of correctly rated players and 20% players who are actually of equal strength, the true win rate of weaker players is actually 12.5%.

Yes, if you assume that 20% are wrongly rated, then the expected winrates will we wrong as well (until the rating updates catch up with the true strength difference).

But how did you determine that 20% was wrongly rated (by more than 150 GoR in the range 4k-4d) and how did that happen?

This is just based on my experience, but it might not be exactly correct, and I don’t know how relevant it is for the sample of all EGD games. Europe is big, and there are regional differences in the go scene.
Let me also stress that the effects of the underrated and overrated players do not cancel (in my mockup example you can assume 20% are 400Gor apart, and the numbers do not change much).

Let me try to list the reasons for people to be wrongly rated:

  • well-rated players have ratings that fluctuate with some standard deviations (SDs). My estimation of SD at 1kyu is 40 Gor, so 30% of these players have their ratings off by more than 40 Gor
  • players from the different part of Europe (they can have the correct rating for their area, which can be off by 50-100 Gor or sth like that)
  • players not from Europe entering the tournament under the wrong rank/rating
  • improving players (especially if they play mostly online)
  • players playing once per few years, they can have their old rating which is several hundred Gor too high
  • etc.

In the end, I don’t know how much this effect skews the empirical data of the winrates of the EGD games. It might be negligible or not.

Even ratings of established players on a plateau will fluctuate. I wouldn’t say their ratings are “wrong”. It’s just noise. The size of those fluctuations is largely determined by the con factor (often called K in regular Elo rating systems), which scales the step size of rating updates. A windowed standard deviation of individual player rating histories is fairly tightly coupled to this con factor.

When we average over all games (not players), we lose that noise and we can just fit a smooth winrate function.

Players who have played very few tournament games will (temporarily) have less accurate ratings, but those ratings will not affect the average of all games very much, because their relative contribution to all games is small. Most games come from players who play a lot and thus have an “established” rating (more or less automatically).

Ratings from one large geographical region can deviate from another large region (this is the case before and after the system update). But again, most players will play games within their region, so inter-regional games will only be a small proportion of all games and those won’t affect the average over all games very much.

The original winrate prediction function was more like a guess (because there was little data available in 1996). But now we have a lot of data, so we used that to improve the predictions.

If you disagree with this procedure, then how do you propose to measure the quality of predictions? How would you prove that the original predictions were “better”?

Or is that you think we should filter the game results of the EGD to exclude untrustworthy games before we start averaging? I think it’s not so easy to define criteria to decide which game results are trustworthy and which aren’t. If you’re not careful with those filtering criteria, you could even introduce accidental biases. And for what reason?
Still, I think such a filtering effort would not change the end result (the winrate prediction function) very much.

Thank you for your response, and for your work on this issue.
I am not saying that the original winrate prediction function was correct. My main point was that there is a specific systematic bias in your analysis of data that does not average out. Statistics is tricky, and sometimes small biases can have a big effect. Or incorrect data of a few % of samples can have big impacts.

“When we average overall games (not players), we lose that noise and we can just fit a smooth winrate function.” Yes, but you also acquire a systematic shift in the winrates. For instance in the case that SD is 50 Gor, you will get cca 15% of games between even players, which will have the Gor difference of more than 100, so the effect will be that it seems that between players 100Gor apart, the weaker player wins more than it is actually the case. The main underlying problem is that in statistics <f(x)> is not f( < x >), where <> denotes the averaging.

One could estimate this effect on the winrates using a simple numerical simulation. But you could also use the data from players with established rating (in some period), compute the average rating of each player, and then put this rating to all his games. So the Gor difference of the game would be calculated from the averaged ratings, not from the ratings at this specific point in time.

Your argument that most games are by correctly rated players is sensible, but one would still need to estimate the number of incorrectly rated games (and not the number of incorrectly rated players, as you correctly pointed out). (In my probably atypical experience this would still be 20% though). Furthermore, it is not clear how big of effect even a small % of incorrectly rated games makes (my intuition is that for big Gor differences it could be noticable).

Yes, I think you should compare the results of all games with the results of “filtered games”. If there is not much of a difference, the effect is small. If the difference is noticeable, then one needs to think further about how to actually filter the data in an optimal way.
As for filtering, I would suggest, for example, including games between players where for both players it holds that:
a) They played at least 20 rated games in the last year
b) Their rating in that period didn’t change by more than something like 35 Gor

That is true, but when the f(x) is close to linear in a small range of x, then it is a decent approximation.

In this case, the function is actually a function of 2 variables r1 and r2 (the ratings of both players). And in typical McMahon tournaments, r1 and r2 are not that far apart (~100) compared to the whole range of ratings (>3000). As long as the f(r1, r2) surface can be assumed to be sufficiently smooth over smaller distances ~100 between r1 and r2 (no sharp local bumps, i.e. its Laplacian is small everywhere) <f(r1, r2)> should be close to f(< r1 >, < r2 >). The statistical analysis basically samples the local gradients of that function (grad f(r1, r2)) and from there you can make a fit for f(r1, r2).

We did use different regression methods to verify. Logistic regression basically gave the same results as simpler regression methods.

Here you can see a plot of the new winrate prediction function, showing it is quite smooth over smaller distances (~100).

Here you can see a plot of its Laplacian.


Those criteria would probably filter too much for lower ratings, because lower rated players don’t play very much and their ratings tend to fluctuate more, even if they are “established” players (because the con factor is much larger for lower ratings).
For higher ratings, con is smaller and players tend to play more tournament games, so the filter would probably filter too little.

So I think the filter criteria would need to be rating-dependent. By trial and error it could be determined which criteria would filter out about 20% of less trustworthy games in each part of the rating range, where “trustworthyness” is some function of running average rating, number of games played and running standard deviation of rating in recent history.

This would also be an iterative process, because the winrate prediction function affects all ratings changes which affects what you are trying to measure. To determine the relation between ratings and winrates, you have to start out with a guess for that relation and work towards some improving measure of fitness/consistency (there is some feedback/chicken-and-egg/self-fulfilling profecy effect). And it is also affected by other parameters of the system (such as con, bonus and external data such as declared ranks and rating reset policies).

Thank you for your clarifications.
I implemented the rating algorithm and used it for random processes. In this way, I got that the standard deviation at 2000 is 44. Then I used this to obtain the empirical expected winrate by convolution, and I saw that the effect is there but it is indeed negligible.

The problem of systematically wrongly rated players is mostly important for the tails of the empirical winrate functions. But the tails are fixed by the Bradley-Terry formula, so now I see that they are not that important in fixing parameter a.

It would still be interesting to see how well Bradley-Terry captures empirical tails, and for that filtering would be important. I agree with your approach for filtering.

ps. Implementing the rating algorithm, I could now plot the average evolution of rating of a player playing against 2000 Gor rated players and winning 70%. This gives an indication of the needed number of A-rated tournament games against similarly rated opponents for your rating to converge.


1 Like

That’s a nice simulation.
So a player who enters his first tournament as 1k (setting his initial rating to 2000 GoR) and consistently scores 70% against players with 2000 GoR (and never plays against players with a different rating) needs about 20 tournament games (4 serious weekend tournaments) to cross the boundary between 1k and 1d.
That seems reasonable to me.

The convergence speed to 2050 GoR may not even be all that different in the old system. The old system would expect a higher winrate for a 1d against a 1k, so it would be less impressed by that 70% winrate (giving fewer rating points). But it also applies a greater con value, so it tends to give more rating points for wins.

But the old system would converge to a lower rating in the end, because it is less impressed by that 70% winrate against 1k.