Problem with handicap in tournaments

shinuito · November 24, 2021, 1:47pm

I don’t think any of the other ratings actually count, or factor into pairings handicaps, other than the overall rank/rating in the upper left of that table. I think they’re just there as nice summary statistics of how much each player plays each time setting and size, and a rough idea of their rating on each size (although it’s independently calculated as far as I know).

In that case it sounds like the weaker player 0.7kyu got 1 handicap vs the stronger one 0.6kyu. I don’t know exactly why it was one handicap, unless it was something like a 9x9 game, or possibly if it is correspondence, then maybe there was a larger rating gap at the beginning.

shinuito · November 24, 2021, 1:50pm

But tournaments are strange Maybe it first decides the colour of each player, maybe trying to do some weird balancing of how many games each player plays white/black based on some tournament setting, then calculates the appropriate handicap after? At least if that was the case, it’s better than you somehow having to give them 6 stones.

I have no idea, and I’m not sure where to look, or if I can, to see the tournament code.

Lys · November 24, 2021, 2:00pm

This was actually a bug.
I thought it was solved.

For correspondence games, rank could change after the beginning of the game. We should compare ratings at that very moment and sometimes someone tried to do that.

I recently made a small SDK tournament with automatic handicap and it worked well.

shinuito · November 24, 2021, 2:29pm

If it was solved since July then it’s all consistent

Feijoa · November 24, 2021, 3:11pm

Ranks can fluctuate a lot, and yes, what matters is the difference at the beginning of the game. I wrote up this page describing whatever I could figure out from experiments, with some suggestions about how to check ratings:

From what I can see, it has been working as probably intended in most kinds of tournaments, except that you can’t get an even 13x13 game, but @_KoBa’s case could indicate that double elimination is still badly broken. If you have any other specific games that didn’t work right please post a link!

Perhaps we need a new tournament to try it

claire_yang · November 24, 2021, 3:34pm

I check the historical record, the second player is exactly the same as 0.6k±0.8 (pretty consistent over the years, the rank is 1.1d ± 0.6 overall), while the first play’s rank at the time of the game started was 1.5k±0.9, the lowest point in 6 months, in fact the second lowest ever in the whole record (1.9k±0.9), except the initial phase. And just one month ago it was 3.3d±0.7 (the second-highest in record, with highest 3.8d±0.7). It is quite a fluctuation, and I think it’s solely due to blitz games (which is odd, they are only 20 out of nearly 700 game, why do they contribute so much? All other breakdown ranks are above dan level, which I can confirm is about the right strength around 2d)

shinuito · November 24, 2021, 3:47pm

The rating system I think is designed to be able to adjust quickly a players rank if they’re over/under performing or it’s a new account. So per game it matters whether they win/lose to someone 3 ranks lower or higher for example including handicap. There will be different changes by different amounts depending on the games and results. The rating history doesn’t really matter in that case as it’s a game by game calculation using the current ratings to determine how much adjustment to do.

I think that explains the 1 stone handicap though?

claire_yang · November 24, 2021, 3:55pm

So it is an unfortunate side effect of the system aggregates different board sizes and time settings into one ranking, where the handicap mechanism only consider it in the setup, instead of an average or consider the fluctuation it entails. Objectively, I feel the first player is definitely stronger than the second player (probably one or two stones stronger).

Does this mean the new ranking system is still somewhat broken or fluctuated too much for certain players? (like bad at blitz and time out a lot?) A correspondant game should consider their long term strength, no?

shinuito · November 24, 2021, 4:22pm

Well I think the system can probably only do what it can with the data it’s presented. I don’t know the players as you do to be able to comment on whether they’re under or over ranked, or whether that’s down to incorporating 9x9 games in with 19x19.

All I know so far is that the effect, while it does impact the overall rating user for matchmaking doesn’t seem to be too important statistically. What I mean is last year this question was asked

I think the conclusion was that the overall rank, which incorporates different board sizes, worked just as well as a predictor of who is likely to win a game as using the ratings individually.

I can’t say I thoroughly read through everything or guarantee I understood everything, I’m just highlighting the relevant messages I recall.

Now of course we had a more recent ratings adjustment at the start of the year

2021 Rating and rank adjustments

Summary of the update:

We are attempting to align our low dan ranks to be comparable to the EGF and AGA low dan ranks. Currently we are projecting that a 1d OGS rank will be about .7 stones weaker than an EGF 1d and about 0.8 stones stronger than an AGA 1d. These numbers will be re-evaluated once the dust settles, but we believe we should end up fairly close to the goal of being within a stone of each system, on average.

We fixed a volatility bug that was making ranks jump around more than they should

Because of #2, we removed our sliding windowed rating system, which means things will be a bit more intuitive now - if you win, your rating will go up, if you lose it’ll go down.

The rating to rank formula has been updated to be ln(rating / 525) * 23.15. This update retargets our dan range to align roughly with the AGA and EGF ranks, and widens our rank bands a bit, which has a marked improvement on handicap consistency through ranks as well as eliminates the “forever 25k” problem we had.

Speed and size specific ratings should be notably more meaningful now and should be more or less on the same scale as the overall ratings.

So I also can’t guarantee the same findings still hold. Maybe one could do the same analysis again whenever the time presents itself.

I think if the player isn’t playing at a consistent strength then their rating might be more likely to fluctuate. Again there’s also two settings to view the ratings history graph which might be useful,

The per game view might be useful, if one wants to just see changes per game, and not have it skewed by long periods of no rating change for example. I’m imagining a rating graph that looks steady but it’s because no correspondence games have finished in that period.

I’m not so sure why a correspondence game should consider only the players long term (or long time?) strength. If long term, maybe that’s only because the games themselves take a long time? If long time, there’s also no guarantee that people actually spend any more time making moves in correspondence than in a live game. There’s certainly more time between moves, but some people just open a game and pick a move and move onto the next game

claire_yang · November 24, 2021, 5:09pm

It’s obvious in this particular case. The reason behind the first player’s “rank fluctuation” is mostly due to time out in blitz, which definitely won’t be a factor in correspondence games. Aggregate the rating of blitz with correspondence makes no sense in this case. Not sure if it’s the rating mechanism needs to be adjusted regarding time-out or the handicap need to consider some kind of long-term history trend if under correspondent setting (or just past correspondence records only)

shinuito · November 24, 2021, 5:38pm

But that’s a completely valid to lose a game in blitz no? Are you suggesting they would win all their blitz games otherwise? Or that the losses on time should be annulled? The latter is of course doable when there is a good reason. People also lose correspondence games on time, because they forget about the game, and similarly that could happen to players in a winning position. It’s just the name of the game.

I’m not sure I understand the purpose though? There could still be other factors than the rating system as to why a tournament game might assign an inappropriate handicap.

This is something we could test and request a fix if it’s a bug we find. On the other hand because some player decided they wanted to play a bunch of blitz games some or most of which are losses on time, I don’t think this means the rating system or handicap system is at fault if they’re not paired correctly?

If you want to think of another case, you could have many players who only play live or blitz games, and if they want to start playing correspondence it makes no sense to look at their correspondence ranking. That would certainly perform worse compared to their possibly stable rating for other time settings.

Otherwise I’m not sure what kind of long term history or trend you’d like to consider? Is this just for custom tournaments also? Should that just be a tournament setting for example instead, some custom metric for assigning handicaps?

claire_yang · November 24, 2021, 6:42pm

Maybe for a ranking fluctuation this large, some kind of long term average from say 1 year ago to present time as baseline? Player’s strength shouldn’t change like 3 or 4 stones in long correspondence game in this kind of timescale, setting handicap at the moment of specific time, say the lowest point, and one month later it grows 4 stones stronger seem a bit wrong to me.

GreenAsJade · November 24, 2021, 10:20pm

Might be? I think it is wonderfully useful, I leave mine set on that setting

Personally, I don’t understand why people think “rating over time” is more useful than “rating over games played”, because of the variability of games-played-per day.

Sorry for interrupting the thread

shinuito · November 24, 2021, 10:44pm

I guess with the rating per game you have no context as to whether this player played 1000 games in 2017 and then none since, or just recently came back to the game after a long break, or is a completely new account. That is by the graph alone, without using say the tooltips which show date when hovering with a mouse/trackpad etc. Even then if that was the information you want then the other graph is convenient.

Not to mention there was several rating system changes, so any ratings pre 2017 vs 2017-2020ish are completely bonkers now.

So even someone who’s played many games pre 2020, their ratings might be completely unreliable/incomparable because of the big rating recalculation. It might say they used to be above 1d when in the older systems they never were. One would have to have in mind what a 1d at that time might mean, but I don’t even have a good grasp of how to translate rating from the various periods of time. Again with hovering over many points of the graph you won’t necessarily get a good context of the time periods of each of the games.

So anyway

I think it’s also great to have the per game graph, but hopefully that might give you some reasons as to why both have their merits.

GreenAsJade · November 24, 2021, 10:55pm

True! Yay that we have both