So, the first thing I think we need to discuss:
We need to make sure we’re actually testing our current rating system and not a “ghost” of it.
Here’s the thing. First off, until we get an official confirmation that the goratings repository is the code currently used in the site, we don’t actually know how exactly it works; and even if we get that confirmation, until we understand the code thoroughly, we won’t know if it actually works the same way Glicko-2 should.
If anything, we already have some evidence that it doesn’t. (see here, then here, then the PS here)
While this is not necessarily a big issue for the practical purpose of the rating system itself, as it works fine for the main purpose of a rating system (matchmaking), I believe it might be an issue for our purpose of testing it:
Glicko-2 is based on a specific probabilistic model, which is built around predicting winning probabilities based on a player’s parameters and then updating those parameters based on how bad those predictions were. So Glicko-2 tries to “converge” to a rating that makes its win% predictions as good as possible.
Thus, for Glicko-2 specifically, it may make sense, for example, to test different “fine-tunings” of it to see which of them has the best accuracy, because by doing that you’re measuring how good a “tuning” is at converging to that supposedly optimal point that the probabilistic model imagines and, well, models.
But if you start changing the structure of the system itself, the Glicko-2 probabilistic model doesn’t necessarily apply anymore.
So if we follow the Glicko-2 specifications to calculate the expected win probability for a match, but we plug in OGS ratings instead, that might be a pointless exercise, as that number might be meaningless for the OGS rating system.
I think I might be exaggerating here, as the OGS system clearly works in a way that’s pretty similar to how Glicko-2 should work. So it must be doing something in the right ballpark, at least.
But for testing purposes, there’s at least one drawback of this, I think: we might inadvertently end up “strawmanning” the system, because the Glicko-2 equation for expected win probability might be a slightly inaccurate approximation of the number we want, which intuitively should be based on a probabilistic model fitting the OGS system instead.
We want to test the system, but our hypothesis is that the system is bad, and we should probably give it the best chance to prove itself right.
(In a cosmic coincidence, we might also end up “steelmanning” the system instead of “strawmanning” it, and while I don’t think it’s worth worrying about, we should at least try to just be as accurate as we can.)