Is the ranking system actually broken (causing an incorrect bot rank)

andysif · February 12, 2019, 1:51am

even its own description says;
"I’m a double-digit kyu bot, playing 20k-10k games. "

my own feeling it’s something around mid-teens, and to be fair any bot that does this

cannot be anything higher than 15k.

I think the current 8k is a little bit misleading. Could someone manually fix this? (Luckily a bot won’t have any bad feeling over this.)

Eugene · February 12, 2019, 2:17am

FWIW, a number of the more powerful bots are still susceptible to long ladder trickery.

Where people are detected taking advantage of this, corrective action is taken (to the extent possible).

The rank of bots, as I understand it, is determined just like people: by the games that they win or lose.

So it would seem to me that it is not “possible” that the rank of this bot is incorrect.

It is demonstrably correct, within the rule system that we have.

However, in looking at this and other “interesting” bot ranking results, I have started to wonder if there is something out of wack with the ranking system. Maybe a mistaken premise somehow?

I don’t know, but what I can see is this:

This bot has achieved 8k while playing within the rule system
8k is supposed to mean “has a 50/50 chance of beating another 8k player”
This bot has never (that I can see) defeated an SDK in a ranked game.

and also - this bot rarely if ever even plays SDKs.

This is leading me to wonder if the rating system is out of wack in the scenario where a player only ever plays weaker opponents, and plays hundreds and hundreds of them.

I would have thought that the uncertainty factor would take this into account, but it does not appear to be doing so - this bot’s rank uncertainty is surprisingly low.

Hypothesis: Each little gain further boosts the rank, until its predictive power is lost.

Other Hypothesis: The rating system assumes that all players play an approximately even mix of up and down.

Thoughts?

andysif · February 12, 2019, 3:16am

i think this is certainly true. there was a period when i was in a few tournaments and for some reason kept having opponents that ranked higher than me and i kept losing and that dropped me about 3k. Not until i had a mixed of both higher and lower ranked opponents that I climbed my way back.

Another observation that i had with this bot is that it played a lot more 99 games than 1919 games. maybe a bot is stronger with small board as there are a lot less combinations.

Yet still another observation is that it seems a lot of people is just fooling around with bots. they made “curious” moves that are not to their rankings. why they do that in a “ranked” game i don’t know, but it seems to be the case.

Eugene · February 12, 2019, 3:21am

Yes - and this contributes to the problem because when one person does that against another, it is a blip in the ocean.

But when scores of people all do that with a single bot, the effect accumulates.

I’m very conscious that this is qualitative speculation about how the rating system works.

I really had the impression that the maths behind the system was designed to handle this sort of thing, but I have only the most cursory understanding of that maths.

I wonder if we should change the title of this thread to “is the ranking system actually broken” in order to attact the ranking system experts to answer this question

andysif · February 12, 2019, 3:43am

will do

flovo · February 12, 2019, 5:15am

Having a look at the rating history

amybot_ddk wins in about 60% of all her ranked games and 70% of her games against weaker opponents. There are enough losses against weaker opponents to ensure “stability”.

amybot_ddks rating fluctuates between 11k and 8k, so 8k is surely to high, but there is nothing going wack there

There are always player which play almost only weaker player, simply because there is always a strongest player. Every ratings system has to deal with this.

If one plays almost only weaker opponents the rating history will look something like this:

There are long winning streaks the ratings rises, but a few losses will drop you back down.

Edit: I want to add that bot ratings are interesting indeed. The lowest ranked bots have always a really high variation in their rating (the rating varies within 400 points).

ckersch · February 12, 2019, 5:23am

My experience playing lots of bot games last summer was that bots play in a really weird manner compared to humans. Fuego was often better than me at reading local fights, but would also inexplicably self-atari itself and die on a large scale. Overall, though, the quality of play is higher than it seems. The atrocious blunders that no SDK would ever make distract from the fact that something like direction of play, for many bots, is better than their human opponents.

This applies even to strong bots. LZ, even on a decent computer, still fails to read out ladders, sometimes, especially ones that extend across the board. I’ve watched it sit at 70% for a variation that had a ladder one move deeper. Despite that, it’s around 9p in terms of overall playing strength, since it’s got excellent direction of play and great reading (aside from ladders ).

All of that is to say: the rating system isn’t broken. It’s working as planned, and bots play weird. The 9x9 bit also likely applies: it’s MUCH easier for a bot to read out what’s going on with smaller boards, so playing strength likely differs substantially based on board size. This is, perhaps, one flaw with the rating system: pooling all time controls and all board sizes hides the fact that some players excel at certain bards/speeds of play. Separating these out into different, non-pooled ratings would likely improve the accuracy of the system.

flovo · February 12, 2019, 5:37am

To discuss this we rather should create a new topic

flovo · February 12, 2019, 5:50am

some additional notes to the rating system

wins against stronger players are worth more than wins against weaker players, and losses against weaker player cost you more points than losses against stronger players.

if a 10k wins against a 15k she gains 7 rating points.
if she loses it will cost her 22 points.
So a win / lose ratio of 3 / 1 would lead to a stable rating even if she only plays weaker players.

for a 10k against a 12k it is 10 points for a win and 18 points for a lose, leading to stability for a win/lose ratio of 2/1.

I would state the opposite, the uncertainty for a player with a very stable playing strength (bots shouldn’t have good or bad days) should be much lower than for a human player. (just my opinion, I was never able to quantify this in terms of glicko rating).

The uncertainty on OGS has a lower bound at 60 points and is in the range of 90-60 points for perfectly stable players with at least 15 games per 30days.

Higher uncertainty also means bigger rating changes. The rating gain/lose is proportional to uncertainty squared.

Eugene · February 12, 2019, 8:33am

I think a bot might be more prone to good days and bad days as it either bumps into situations where it makes those silly mistakes, or bumps into people who’s play style breaks it.

But I was coming from a different direction. I had the impression that your uncertainty increases if you don’t have both wins and losses.

For example, if I start new here, and never have a loss because I always play TPKs, surely my uncertainty should stay high. I thought that to get certainty about your ranking you need it bounded by wins and losses.

In respect of my comment that “amy-bot only plays weaker players” I mean in recent history. Now that amy-bot is 8k, and for at least the most recent few pages of it’s history, it only plays DDKs, but it is an SDK.

In the last 150 games, only 15% of games are vs stronger opponent, and that person only started playing since this conversation began

flovo · February 12, 2019, 10:19am

This is true. For new player this can prevent them from leafing the provisional rank behind.
For player with long winning streaks, I know that the deviation can be as high as 130, but I don’t know if this is an natural limit or if it could rise to infinity. For amybot this is no problem since it hasn’t long steaks of ranked wins (the longest I can find is about 15 or so games).

Yes, I know. sorry if I gave the wrong impression.
For the rating system it is not really important, if one plays only weaker players. So I hadn’t a detailed look at this.

smurph · February 12, 2019, 11:15am

In fact, you can manually fix this by beating the bot in ranked games ,over and over again.

txwolf · February 12, 2019, 12:08pm

played two bots, one by accident, the other curiosity. i will never play bot again. feel like testing someone’s software for free. not acceptable as an IT professional.

Eugene · February 12, 2019, 12:13pm

Not testing, Training

siimphh · February 12, 2019, 12:19pm

I can offer some insight into how amybot behaves (AFAICT - I’ve built it and am running her but she is based on machine learning so I fundamentally don’t understand her).

I’m constantly tweaking amybot’s strength for 9x9, 13x13 and 19x19 separately based on the performance over last few days (currently 3d window) and try to nudge it to be 50/50 in ~even games (and +/- 5 to the win rate for rank differences of 1-3). Currently I think I have it set it to try to be 13k for amybot-ddk but… The auto-tuning might be broken in subtle ways.

As for it’s fundamental performance on different board sizes, it is certainly much much worse at managing life and death of large groups (or ladders) on larger boards. I’ve observed exactly the same as @ckersch88 - it just takes the losses without flinching and tries to find compensation elsewhere. It is even crazier with amybot-beginner that just loses stones and groups left and right while still managing to come out ahead in the end in surprising number of games.

I think one more source of uncertainty in the system is that amybot-ddk (and amybot-beginner even more) have lots of new players play them for the provisional games because they are the lowest-ranking bots on OGS right now. Combined with the erratic performance (missing obvious-to-humans life and death) they may win or lose uncharacteristic games and then be affected by the rank correction that the opponents will experience when also starting playing other humans.

flovo · February 12, 2019, 12:54pm

The current ratings of amybot-ddk

	overall	blitz	live	correspondence
overall	10.0k ± 0.6	10.6k ± 0.6	10.2k ± 0.6	10.7k ± 0.7
9x9	9.6k ± 0.6	10.2k ± 0.6	9.4k ± 0.6	9.2k ± 0.9
13x13	12.3k ± 0.7	12.5k ± 1.2	12.7k ± 0.7	9.7k ± 1.0
19x19	13.3k ± 0.7	15.7k ± 1.1	13.8k ± 0.7	13.6k ± 0.9

rating over time. @amibot_ddk is getting stronger

and amybot-beginner

	overall	blitz	live	correspondence
overall	27.5k ± 1.5	27.0k ± 1.2	27.1k ± 1.2	22.9k ± 1.1
9x9	27.8k ± 1.3	27.1k ± 1.3	27.9k ± 1.3	22.9k ± 1.1
13x13	25.2k ± 1.1	24.4k ± 1.0	22.6k ± 0.9	23.9k ± 1.0
19x19	21.2k ± 1.1	19.7k ± 0.9	22.7k ± 1.0	18.7k ± 0.9

The rating history of the weakest available bot is always a mess

Interesting. So amybot_ddk should be 13k instead of 10k. Maybe the glicko-rating gets to high if the rank difference is to high. I’ll try.

Edit: If I calculate ratings for only the games against 15k±1k, overall rank goes down to 12k.

siimphh · February 12, 2019, 1:20pm

I have like 50% confidence in my tuning script being entirely correct though and I’v just made it run automatically every day a few days ago. Before that I just occasionally did it manually.

The last few rounds it has decided to make amybot-ddk weaker in general. Looks like it thinks it’s ~OK for 19x19 at this point but still needs to reduce strength for 9x9 and 13x13:

9x9:

2019-02-12 00:01:57,759 2445 INFO ogs-get.report():376: Evaluated 53 players, 554 games, 16 with single game
2019-02-12 00:01:57,759 2445 INFO ogs-get.report():378: Overall win rate: 0.78
2019-02-12 00:01:57,760 2445 INFO ogs-get.report():381: Strength profile: too_weak=1, just_right=3, too_strong=18
2019-02-12 00:01:57,760 2445 INFO ogs-get.report():389: Win rate delta: mean=0.15 p5=-0.51, median=0.20, p95=0.55
2019-02-12 00:01:57,760 2445 INFO ogs-get.report():393: Weighed win rate delta: 0.21

13x13

2019-02-12 00:02:20,139 2621 INFO ogs-get.report():376: Evaluated 12 players, 31 games, 3 with single game
2019-02-12 00:02:20,139 2621 INFO ogs-get.report():378: Overall win rate: 0.61
2019-02-12 00:02:20,140 2621 INFO ogs-get.report():381: Strength profile: too_weak=1, just_right=1, too_strong=3
2019-02-12 00:02:20,141 2621 INFO ogs-get.report():389: Win rate delta: mean=0.11 p5=-0.56, median=0.25, p95=0.42
2019-02-12 00:02:20,141 2621 INFO ogs-get.report():393: Weighed win rate delta: 0.14

19x19

2019-02-12 00:02:55,595 2847 INFO ogs-get.report():376: Evaluated 49 players, 102 games, 30 with single game
2019-02-12 00:02:55,595 2847 INFO ogs-get.report():378: Overall win rate: 0.59
2019-02-12 00:02:55,597 2847 INFO ogs-get.report():381: Strength profile: too_weak=3, just_right=2, too_strong=6
2019-02-12 00:02:55,597 2847 INFO ogs-get.report():389: Win rate delta: mean=0.04 p5=-0.48, median=0.10, p95=0.70
2019-02-12 00:02:55,597 2847 INFO ogs-get.report():393: Weighed win rate delta: 0.07

DVbS78rkR7NVe · February 12, 2019, 1:48pm

Easy way to be beat 3k bot too:

smurph · February 12, 2019, 1:55pm

I’d recommend only using a single board size per bot…

andysif · February 13, 2019, 1:37am

Since we are on the topic, maybe someone can explain this. I just don’t understand why a bot would sometime miss a ladder.

even a simple life and death problem in the corner can have up to 20 branches if they set the depth to 3 or 4 steps, and the bot can figure it out. But a ladder has absolutely only ONE path. Why then would a bot miss it? Is it because they set the depth to 10 or something and the bot stops there?