Well, to me it seems fairly standard from the point of view of “Well, this is roughly how far it spreads from the expected score on average”
The problem with “correctly labelled / all samples” is that it doesn’t take into account that the system isn’t trying to get correctly labelled samples, it’s trying to move the player to find an estimated rank that’s as close to that player’s true rank as possible. Because of this, a major upset (such as a 13k beating a 1d) should prove a bigger source of error than a minor upset (such as a weak 13k beating a strong 13k). I’m not 100% behind the measure that they used, but it’s not completely worthless.
Now my gut (naive) reaction would be to do something like for each win add (1-Expectation), and each loss add Expectation (and then probably divide by all data points for a contractable number), that way you’re getting the error of each game.
However,
this is a problem that has been considered by people who do ratings systems, notably a competition held by FIDE (Deloitte/FIDE Chess Rating Challenge | Kaggle) which Glickman himself not only competed in (and came in fifth with his Glicko-boost algorithm), but as the chairman of the US Chess ratings committee suggested the measure by which candidates would be measured, of which is described in the long link above and now here below:
The evaluation function used for scoring submissions will be the “Binomial Deviance” (or “Log Likelihood”) statistic suggested by Mark Glickman, namely the mean of:
-[Y LOG10(E) + (1-Y) LOG10(1-E)]
per game, where Y is the game outcome (0.0 for a black win, 0.5 for a draw, or 1.0 for a white win) and E is the expected/predicted score for White, and LOG10() is the base-10 logarithm function.
There isn’t a full explanation for the use of it, but I would imagine its resemblance to the Binary Entropy function is largely the proponent, being a measure of how surprising a result would be from a binary outcome on some probability.