Current autoscore failures (2025)

This a kind of follow-up to A compendium of OGS's terrible scoring system confusing beginners which IMHO is a bit outdated now with the release of Auto-score improvements and Stone removal and scoring updates in 2024. I don’t have any insider info about how the new autoscore works, but it seems to be a lot better than it used to be, and some problems that popped up after the change seem to have been fixed.

Overall autoscore is really good! I personally think it’s better than human scoring. But there are still some ongoing failures and a perception that it’s unreliable, so I thought it would be good to track the failures in a new thread. I’m counting things like

  • surprise invasions
  • clearly dead groups being marked alive
  • not following the rules of go

This doesn’t count cases where the autoscore just didn’t run or where the players changed the markings themselves.

Please share your own examples in a reply and I’ll check them out and get them added to the top post. The “autoscore tester” links give you an easy (but unofficial) way to inspect the bug without having to fork a game.

Known autoscore failures:

A single black stone that’s clearly dead, getting marked alive since it’s next to a weak wall:
image
game link
| autoscore tester

The group on the right is in atari and certainly dead, right? But it gets randomly marked alive about half the time:
image
game link
| autoscore tester

The upper-left seki needs one more white move; as it stands the white stones are killable by Black. Since nobody did anything about it, we should assume both players thought the white stones were simply dead:


game link | autoscore tester

A random black stone in the middle of what seems like clear white territory but has a hidden weakness:
image
game link
| autoscore tester

6 Likes

I may disagree. Well not so clearly at least. If black extend (nobi) what next?

2 Likes

I think what matters is that White can play first to kill it. If Black thought it was alive he would have continued playing.

2 Likes

I suppose the lone-stone example is given to illustrate the Autoscore Bug (AB), which I have been flogging for 5 years now. The AB does still occur, but less commonly than before, I think. I don’t believe it has anything to do with the autoscore not running because it has occurred with a scored board in which stray dead stones (or small groups) on both sides were left unmarked. As you know, it may have something to do with playing apps, as the bug came to light roughly around the time that those apps were becoming more popular.

Thanks for starting this new repository for information. I will certainly post any new AB examples I run across.

1 Like

No, I’m trying not to make this about autoscore bug that you’ve been flagging - as far as I can tell, “AB” happens when the autoscore fails to run at all and leaves everything unmarked. I’d like to keep this thread about cases where the autoscore algorithm runs and produces a reproducible bad result. My “autoscore tester” links let you see for yourself the exact output of the OGS autoscore API without having to fork a game.

I can easily add a dead white stone in the upper-left to demonstrate that it’s not just leaving everything unmarked:


(the square means that it was marked dead).

1 Like

Sorry, but that is just not true. The anomalous quality of the AB was precisely because the autoscore ran and some, but not all, territory was left unmarked because of a stray dead stone or small group. I’ve posted many example in the thread that this supercedes and in the CM thread where I discussed the subject at length.

Here is an example of what appears to be an anomalous malfunction of the scoring sequence: Play Go at online-go.com! | OGS. I am citing the report instead of the game, because it involves score cheating, and I want to avoid public shaming. The game has consecutive acceptances of the score, one by each player, at 2:08:49 and 2:08:50, and yet the game did not end.

To be clear, I meant that the dead stones were not marked - that’s the primary function of autoscore. All groups were considered alive, and territories were then counted accordingly. If you find cases where autoscore actually runs (we can verify that in various ways, like with my autoscore tester) and still results in problematic scoring I’m happy to add them here!

1 Like

I have an interesting theory. So the other day, I found out that the spectator estimators and the in game estimators are different! Surprising, the spectators always have the better ones. Maybe OGS is trying to encourage players to try calculating without an estimator? :thinking:

Yes, the score estimator during the game is deliberately “hamstrung” to be imperfect, to prevent people from using it as a crutch. This change was made about 3 or 4 years ago.

1 Like

I don’t understand should white add a move or not? I would rather call the situation as unsettled until white add a move instead of a stone clearly dead

I don’t see a need for White to actually add a move. It’s obvious that White could kill it by playing first, and Black gave up the option to try defending by passing. So both players have clearly indicated that they believe that stone is dead.

They might be wrong to be so confident about this conclusion, and of course there will be edge cases, but in this case isn’t it clear that autoscore is revealing something the players didn’t see?

1 Like

I yes, ok if both players agreed about what is dead then it should be clear for the autoscore too. But then player should mark this stone as dead, it could be hard for the autoscore to understand the game in the same way as the players

(In the second exemple, it’s much more clear, a different case)

Yes, we might eventually come up with some cases that are understandably hard, and in those cases it might make sense to default to alive and let the players work it out.

But this one seems really easy - there’s just one unsettled black stone in a huge white territory. And my proposed autoscore v3 algorithm gets it right. It’s interesting to see the cases that OGS autoscore gets “wrong” because that might reveal something about what the limitations of its algorithm are.

(By the way, the players in this specific case had no opportunity to mark stones dead because it was a bot game.)

Not to me lol. It’s not about how many stones there are.

There is no obvious answer with a nobi by black. It’s sente on the left (or capture will happen) and then it’s threatening the right. I’d say that between beginners it could be a 50/50 chance to win

That’s not a position I would call a autoscore failure. In the other example it’s clear, just eat a bunch of stones so nothing will ever happen. But here it’s difficult for an autoscore to understand the position and agree with your view (just a stone in white territory)

But Black didn’t nobi. Is it not clear that they thought it was hopeless? Do you really think the players might have thought that one stone all by itself was safe?

1 Like

How the autoscore can determine who is right? The players may think as you say, but how the autoscore will understand that this is white territory from the agreement between the players?

1 Like

That or “White didn’t capture, maybe because white is already dead on the bottom”.

How can the algorithm guess the players thoughts?

Even if this is not such an example, such examples exist (after all, there are examples where this thinking is correct).

You really believe White might have thought its huge group on the bottom was killed by that single black stone?

Are you just trying to pretend you’re an algorithm or is this honestly unclear to you?

Of course it can’t always guess the players thoughts, and there will be edge cases. But it does pretty well in most cases and it’s worth pointing out the ones where it fails. If a serious edge case comes up I’m happy to talk about that too, but this to me looks like a failure.

I don’t know how OGS autoscore works. But as for how it could possibly work, we have the example of autoscore v3 that does work in this case, and I’ve explained the steps in the calculation very clearly at the top of the page:

https://pdg137.github.io/autoscore/v3.html?game=73244030

1 Like

Yes insofar that it’s unclear to me how an algorithm should draw a line between the two cases.

I agree that this is a case where the OGS autoscore result was probably not aligned with player intentions though.