Alternative rating system to goratings.org

Vsotvep · November 12, 2021, 8:12am

Sure, but this doesn’t solve the problem of bias: a rating system is supposed to answer the question who is stronger and who is weaker. It should be the thing that decides your opinion of who would win in a matchup, instead of the other way around, where your opinion of who would win in a matchup decides what the rating should be.

Belonging to the Japanese system could give you an opinion, but it should not be relevant to a discussion about accuracy of a rating system: the system itself should first be shown to be accurate, and then it becomes meaningful to state the opinion that, e.g. Ichiriki Ryo is not a top 20 player.

In this whole thread the OP is doing it the other way around, and claiming the rating system is accurate because it rates Ichiriki Ryo below the top 20, instead of claiming the rating system is accurate and therefore Ichiriki Ryo is not a top 20 player.

Furthermore, I do have even stronger objections if also the bias seems unreasonable; where the opinion is formed not based on results in matchups, but based on which system the player is part of and the lack of international titles.

I mean, I can be easily convinced that Ichiriki Ryo should indeed be ranked lower, but the arguments for it found in this thread aren’t convincing at all.

Groin · November 12, 2021, 8:29am

I still think race has nothing to do here until evocated by @BHydden. maybe could talk about (anti-)nationalism.
now if you take a so broad definition you are going to lose the sense of it. i am ok with the feeling that the japanese go associations and their pros are not doing yet that well on the international scene, it’s not a secret and i know they did well before too. it’s not genetic.
now if you go strict you gonna have to supress a lot of things, national rooms and flags on OGS, soccer afficionados… If something bothers me a lot , it is nationalism.

aesalon · November 12, 2021, 8:32am

This is a bias too. The algorithm is the algorithm. Judge that and not motives and the person that made it.

In the end this will be a hot topic until Japanese players actually start participating more actively in China A league and every other international event.

You can’t create an idealistic rating with so few games.

aesalon · November 12, 2021, 8:41am

Also he really hasn’t said it is accurate (let alone more accurate) his first post after OP was saying that he is looking to resources to test them both. Posters jumped him pretty quickly asking for evidence and support for a claim he hasn’t made and his reply was a reasonable:

There isn’t similar support for WHR when used on professional players (Remi only wrote about it’s predictive capabilities with KGS games). Keep your nitpicking hat on and use the same skepticism towards both sides to provide more constructive feedback.

Instead there is a lot of reading into motives and jumping to conclusions…

jlt · November 12, 2021, 8:51am

I am not familiar with logistic regression. Can you (or someone) expand this formula so that it becomes explicit for those who don’t know what logloss or logit(S_x-S_y) means?

Also, you say you use machine learning to make the computation. Does it mean you have a minimization problem, and you find an approximation of the solution? How accurate is the approximation?

Kaworu_Nagisa · November 12, 2021, 9:15am

The algorithm was chosen specifically to exclude. He outright stated multiple times that his desire for finding a new ranking system was to exclude a player based on their nationality.

His response was “he gets beaten by many Chinese players”, and refused to comment on Korean or other nationalities for some reason. jlt pointed out that his original claim is incorrect. 7 wins out of 13 games can hardly be called “beaten by many Chinese players”. xiaodaiboy refused to comment on this. It wasn’t until jlt questioned him again did he respond, and he only said “you do your analysis I’ll do mine”, which is avoiding the point yet again.

" My ratings are based on last 365 days of games, which I deem to be reflective of a player’s strength" is not justified at all. I can simply say “goratings is based on a dynamic Bradley-Terry model that directly computes the exact maximum a posteriori over the whole rating history of all players, which I deem to be reflective of a player’s strength. so yeah” and have just as much, if not more, validation in using WHR over his method.

There really isn’t. We’ve more or less probed him enough to remove the other options. His repeated statement of “he is Japanese so he can’t be in the top” is grounds enough to hint at racism. He has made no attempt to prove that his method is more reliable than WHR, and has only relied on “a Japanese pro can’t be top 10”.

I don’t think BHydden is wrong to bring this to light. It can be informative even if Bhydden is wrong in his claim, as xiaodaiboy can take the opportunity to state otherwise and ensure we focus on the correct thing: the algorithm in question. xiaodaiboy should be capable of defending himself against a claim of racism.

His statement of “Ichiriki’s gorating was an observation that didn’t fit reality (my reality might be biased, but my modelling method is objective)” is provably false since his claim of “my modelling method is objective” is provably false. Therefore the “my reality might be biased” statement is what remains. So we ask “in what way might his reality be biased?”

Based on his own admissions, his reasoning for creating a new rating system was specifically to demote a Japanese player.

I think after all of the deliberations so far, “racial profiling” is certainly a line of questioning that can be pursued with justification. It may not lead to anything, but it is at least worth addressing.

xiaodaiboy · November 12, 2021, 9:39am

game_result_x_y_g = 0 or 1
S_x is the strength of player x in the range of -infinity to infinity which is unknown and need to be derived by optimizing the below formual.

Logistic regression that I set up optimize over this

sum(over g of logloss(logit(S_x-S_y), game_result_x_y_g)

where logloss(a, b) = -(b*log(a) + (1-b)*log(1-a)) which to those who know the theory is just -log(likelihood of the game result given the strength of players X and Y (S_x,S_y).

and logit(x) = exp(x)/(1+exp(x))

In one year, there are roughly 5000 games, so just optimize that sum above by all g’s (over 5000 games).

anyway, you don’t need to know these formulas. Basically, that’s the set-up of a logistic regression which is one of the most popular methods for modelling the probability of binary outcomes (win/lose).

I added some other terms to estimate the strength of white advantage in 7.5 komi games, but the basic setup is the same.

Anyway, logistic regression is not the be-all and end-all. Ultimately, we just need to test whether this method is more accurate on average vs goratings. This study can be done.

Theoretically, any numerical solution is “an approximation” so yeah. How do I explain this? Basically, the algorithm are designed to run until the approximation is “close enough” by some definition of close enough. For most practical applications, just leaving that “close enough” setting to the default is good enough in practice. So I didn’t change the default setting in the optimization algorithm.

jlt · November 12, 2021, 9:58am

So how close are your numerical values from the solution to the minimizing problem? 1 rating point? More? Less?

claire_yang · November 12, 2021, 10:00am

Shouldn’t you already be able to do that with past data? Like using 2019 to 2020 data in your case compared with gorating historical data, then use records this year and test how they fair? (and all the way to decades ago), even with a third comparison

P.S. the author specifically state it uses Glicko-2

xiaodaiboy · November 12, 2021, 10:15am

eh… like… hmmm, how do i say this. basically the algorithm i use is fairly well test and should come close. much close than 1 point in difference. I think you just need more background knowledge on logistic regression and numerical optimization and the defaults as implemented in R. Too hard to explain a uni level course, but my code is open on github. so those who are interested can check it out. i am happy to answer question on github if needed.

i think I’ve mentioned it can be done. just takes effort.

just read about glicko. seems like the innovation is the introduction of the estimate of the volatility. Which i haven’t done. Either way, my ratings corroborate with this glicko-2 derived rating which is very different to the goratings rating.

So ppl can make up their own mind which ones to look at. anyway, ratings are just a bit of fun. Iyama beat Ding Hao recently, so anything is possible. Rina also beat Ichiriki recently so nothing is set in stone.

A proper study of the accuracies of the ratings would be interesting though.

Maybe I just put up a crowd funding campaign and people can donate to di. I reckon I would do the analysis and publish the result for about USD$5000

claire_yang · November 12, 2021, 10:21am

Isn’t it just put different data in, and the computer do the rest of the work? Even just generate a historical list with your method, and there are enough programmers in the forum will be able to help compare them.

xiaodaiboy · November 12, 2021, 10:21am

come apply for a data scientist role in my team. love to have u on the team.

claire_yang · November 12, 2021, 10:47am

I actually more or less understand your method, and it has a base assumption that a player’s true strength is static and time-independent of the games they played within that year. I think your method is actually a special case of Decayed-history Rating, where all games within a year has weight 1, and before that weight 0.

gennan · November 12, 2021, 11:04am

Don’t most rating systems use logistic regression, including Glicko-2, EGF rating system, WHR and plain old Elo rating systems (Elo as a statistical learning model | Steven Morse)?

I think the differences between these systems come down to selecting different data sets, different regression windows, and different update factors and update criteria. The programmer can choose those to get a system they consider better by some system performance metrics they consider important, maybe at the price of more computation costs than plain old Elo.

It would be good to know which system performance metrics you were trying to optimize and how your choices contributed to improving those system performance metrics.

xiaodaiboy · November 12, 2021, 11:07am

U can think of it like that.

xiaodaiboy · November 12, 2021, 11:10am

In these games it’s common to validate using logloss on data unseen by the model. Just take the rsting just before a game happens. Check how close the prediction are to the game result. Do this for many games and show the average.

gennan · November 12, 2021, 11:17am

Yes, but most rating systems work by updating ratings to improve its predictions for future game results. So did you measure that your system makes better predictions than WHR when using the same data set? If yes, how much better is it?

jlt · November 12, 2021, 11:37am

There are differences. For instance if you don’t play, your EGF rating doesn’t change, but it may change in other rating systems.

Also the EGF rating depends on the order in which the matches are played.

gennan · November 12, 2021, 12:56pm

Yes, in the EGF rating system, player ratings are only updated after participation in a tournament (and sometimes right before, when the player gets a rating reset or when a player is new to the system). The order of results within a tournament doesn’t matter, but the order of tournaments matters.

But I consider such things to be more like “configurable” elements (rating update criteria, regression window selection, various weights) surrounding the core system, than inherent properties of the core system (the rating update formulae + iterative process, if any).

xiaodaiboy · November 29, 2021, 9:59am

Iyama Yuta rises to 25 on my ranking due to recent strong performance vs top Korean and Chinese players

https://daizj.net/baduk-go-weiqi-ratings/