Site Ratings Period


#1

Hello,

From the docs people helpfully linked: http://www.glicko.net/glicko/glicko2.pdf

The Glicko-2 system works best when the number of games in a rating period is moderate to large, say an average of at least 10-15 games per player in a rating period. The length of time for a rating period is at the discretion of the administrator.

Following on my previous post, I was hoping that someone could say what the ratings period for OGS is. (Strangely, details of the Glicko math used were entirely absent, despite the fact everyone knew the buzzword.)

Your score updates each game, but a ratings period of a single game is a major violation of the Glicko math.

So I’m wondering:

  1. How are we updating scores each game?
  2. What is the Glicko ratings period?

I won a game and my rating went down
#2

I don’t have any proof, I’m just recalling (possibly erroneously) what I have read. My understanding is that

  • Our ranks are updated with a provisional result each game

  • The ratings period is a 3-4 weeks (can’t recall precisely).


#3

In my case I’m playing correspondence games only and as these take a long time to finish, sometimes my game record goes unaltered for weeks. How does this affect the calculations when there are only a few (if any) games every 3-4 weeks available in the records?


#4
  1. When a ranked game ends, both you and your opponent get their new rating calculated based on their tally
  2. 15 games or 30 days
    OGS has a new Glicko-2 based rating system!

#5

Using this technique consistently produced better results in our experiments, and we later found that Prof. Glickman applied some techniques akin to this in his Glicko-Boost work in 2010, which made us feel better about doing this as opposed to simply using the original ratings.

Reading that, it sounds like we actually use a “folk math” version of Glicko, and not any of the official Glicko methods – exactly like I’d suggested (and been rudely shot down about).

I’m very curious how they tested this, because it sounds like that testing is the only thing underpinning the entire ratings system of the website, which is a “folk math” system modified from a real one by the admins of this website, and not any sort of standard rating system at all.

I also think people need to be more honest about this: We didn’t implement Glicko or Glicko2, we implemented OGSGlicko, which has none of the theoretical support those other two did. In light of this, the responses I got questioning the ratings system were really inappropriate.

You guys made “folk math”, and it works badly.

Could one of the admins chime in on how they tested their customized Glicko method?

Strangely, despite having a section called “implementation notes for the curious” you left out the key detail of your implementation. That doesn’t instill confidence.


#6

Seriously, could any of the admins chime in on how this was tested?

You guys have made some really bold claims and based the entire site and community on that test.

Share it with us.


#7

The wording of your question pretty much guarantees not getting an answer.

“This system works badly, so tell us what testing you did”.

If OGS admins respond to that by elaborating what the testing was, they would be basically agreeing it works badly.

That is far from established.

In fact, other than asserting it is the case, and going on to call it names, you haven’t described what it is that you think works badly.

The description of what was implemented is quite honest. The section you quoted in fact starts with “Grouping as envisioned by Prof. Glickman essentially works by considering many games at once, which for in-person tournaments is not a problem. For online play where we have a lot of ad-hoc games, we needed to get a little creative” which should tell you that the devs know how glicko works, recognised that the online play environment doesn’t entirely match, and applied a custom solution.

I think you’d have better luck in your quest to explore how it works at OGS and shine light on potential improvements if you focus more energy on calmly describing the perceived problem, and less on dissing the place and the people who run it.

GaJ


#8

Not at all.

It would show that they did the responsible thing by performing reasonable tests to validate their modifications to the rating system.

Or that they didn’t.

But what’s clear is that the community doesn’t really understand they aren’t using off the shelf Glicko, and thus don’t actually have an underpinning to the rating system here except that testing.

If that’s the only thing validating their choices, they’re absolutely obligated to share it with the community.

“Just trust us, we’re experts!” isn’t very convincing, especially because many members of the community feel there are issues. (Hint: Maybe some of the people who are complaining are experts on the stability of these kinds of systems and think the math looks funky.)

It tells me that the devs implemented a custom solution, but that doesn’t mean I should just trust them that it works, particularly in the face of empirical evidence that it doesn’t. It most assuredly does not require that people understand the mathematics in order for them to type in code, which is all that seems to have happened here.

In fact, that they keep saying they implemented Glicko when they didn’t implement Glicko smells of dishonesty.

That they’re unwilling to say how they tested their modifications is a great reason to doubt their expertise on the topic – they don’t want to say because they know it’s a serious breach of the theory they slapped together half-assedly, without any solid proof that it’s stable.

That’s irresponsible leadership. Period.

There’s absolutely no way to do this, since they wrote a special-snowflake system based on an actual one, then refuse to actually discuss how it’s implemented or tested.

You’re being dishonest – I did exactly that in describing the problem originally, then people went “lulz, we use Glicko” which is untrue, the site does not use Glicko, and the admins are simply silent on the underpinnings of their actual methods.

How, exactly, do you expect me to refute their magic “we tested it!” when the admins won’t discuss how they tested their bespoke not-Glicko rating system?

I’ve already dug into it substantially: enough to realize that the modifications they made and the continuous ratings probably aren’t stable, and can cause the sorts of dynamic problems people have been calling out around 13-kyu/1500. But there really isn’t a way forward without seeing their work on if it actually is stable, or not. (And the degree to which they analyzed that before rolling out a new ratings system.)


#9

@a20170527 Can you elaborate on what you think the actual problem is? I haven’t seen anything in this thread that constitutes a legitimate problem with the rating system.

Your assertion here (at least, how I read it) is that the ratings are updated too frequently for it to be an direct expression of canonical Glicko2. Do you have any reason to believe that the ratings are fundamentally calculated incorrectly? Maybe some ratings numbers to show where long term ratings diverge between OGS’ ratings and a direct Glicko2 implementation? And then an analysis of how that is hurting the community at large?

In the absence of an actual problem, it’s not feasible for the staff (when I say staff, I mean the one developer we have) to divert attention from verified bugs.


#10

Yes.

The fast update rate causes “eddying” and you’re ratcheting eddies into place, rather than capturing long term behavior around the entry point of the system – only gaining confidence that people are caught in the eddy, not what their skill is.

Eventually they pop out the other side, but the ride is needlessly turbulent and that’s disrupting the flow of lower players into stronger players.

Your question is nonsense or dishonest: the matchings would be different if you utilized actual Glicko, hence there’s no way to perform the experiment that you propose, since I can’t do Glicko pairings of the same people who are exposed to your non-Glicko system to compare.

It should be clear how it’s hurting the community though, based on the feedback: you have too much turbulence around 1500, because of the rapid score updating. That’s why Glicko has a round size – it smooths a lot of bad behaviors out by averaging the impulses each update.

However, I’d be happy to perform analysis on your guys’ records DB to look for exactly the turbulence that I’m calling out. It sounds like you don’t have the time – and I’m more than happy to give you actual contact details (not in public) before you share access.


#11

You could have just said no. Thanks for you response. We appreciate feedback.


#12

That this is your attitude when someone points out actual issues with your implementation (the math part is based on your implementation notes, and is simply a math fact) and offers to help you fix it speaks to why people should question your not-Glicko rating system.

This attitude is bad leadership, and I suspect it reaches beyond the comments here to how your developer implements code.


#13

You haven’t pointed out an actual problem.


#14

You’re simply being dishonest now.

You’re ignoring that I pointed out deep problems with the theory of what you did, and you’re refusing to give me access to the data which would show that problem is occurring.


#15

Okay. Have a nice day.


#16

You too!

I hope the community does well and you eventually meaningfully revisit not-Glicko.


#17

You understand that 13kyu is the starting point for all new account right? Which mean people who aren’t 13kyu better or worse will be marked as that rank for their first few game.

So it would make sense that the 13kyu rank area would have a wide gap in skill if you were playing only provisional players, it would be the same if our account starting ranks were 30kyu there would be a wide skill gap there.

The good players will raise in rank fairly quickly, the bad ones will drop. Most likely before their provisional period ends.

Edit: This isn’t an attack but just an honest question that I am going to ask you, why do you care so much about the rating system even when you are 21kyu? I guess it wouldn’t matter what rank you are but it just seems weird to me.


#18

You’re failing to consider the dynamics of someone who starts as 17-kyu and is working towards 11-kyu.

They’ll start at 13, fall to 17 fairly quickly. They’ll then play a lot of games to rise, giving them a lot more confidence that their rank is correct as they approach 15-kyu. (So far, the system is working as intended.)

The problems begin as they approach 13-kyu again: they encounter a mix of skills that are best described as “randomly distributed between 15-kyu and 10-kyu, with about 10% crap and stronger people”. That causes them to hailstone around, because what happens is that the few games you lose to stronger people knock you back across the 13-kyu gap, which causes you to have to wade through a large number of crap games to approach the 11-kyu exit of the eddy… and if you lose a game around the exit, say against a 9-kyu, you’ll fall back into fighting unranked people and bounce around some more.

The longer you spend in the eddy, the more sure the system is that you belong in that mess, even if the reason you’re still there is you played people who all were outside of it. (This is partially because of the rapid updating.)

If people had a static skill level, this system would be fine – the problem is they don’t, and can shift quite a bit over the span of 6 months. This system is bad at dynamics, because they removed a lot of the things that made Glicko work with dynamics (eg, round size).

That’s my rank after I quit like 30 tournament games when I got fed up with the eddying around 15-10kyu. (It seemed better to just quit if I was done playing – but it murdered my rank, lol.)

Also, I used to be a site supporter (in the monetary sense), because I like go communities. Partly, my concerns started when I was still paying and felt that was a poor way to run things I contributed towards.


#19

Yes the provisional ranks this would still be the same though even if we started at 30kyu, except the bad players would have no where to drop to. You would still end up playing people with an uncertain rank. No matter what starting point.

It seems like it would be the minority You would play with people that are provisional less often you would then someone who has played their rank games and got a “solid rank” Even if you did play an influx of provisional players you still have a coin flip on what their rank is.

All in all I dont think it would hurt all that much.


#20

This isn’t my objection, even remotely.

There’s two things wrong with it:

  1. It assumes we only can have one entrypoint, which is nonsense.
  2. It assumes that we need to use the highly unstable not-Glicko, which is my main objection.

The reason I don’t agree with you is because neither of those is true.

Except that the entire region is unstable, so there’s no meaningful rank between like 15-kyu and 10-kyu. There are no “non-provisional” players there, because the entrypoint has destablized rankings in an entire range through constant dynamics not-Glicko handles poorly.

Poor dynamics around the entrypoint create a vortex in the rankings in that band that all players get sucked into.