The integrated AI Review feature for OGS

Animiral · May 20, 2019, 2:33pm

I am convinced that there are real technical problems with the AI review as currently implemented. The evaluation is not as good as advertised. This is why some of its outputs are so baffling.

I played this game recently where I started out with a generally favorable opening. Unfortunately, I soon learned that my opponent was great at fighting.
At a critical point, I made a 60% mistake and lost an important group in a snapback.

Black must save his 10 stones and threaten the 7 white stones below. My move was at the circled point. White answers at the blue point.
The full review by Leela on OGS does not indicate this 60% blunder at all.

This position is judged as 79% for black, close to my own analysis with Leela. Then…

B m6 +11.1pp -> 89.6% (totally wrong!)
W o8 +26.2pp -> 63.4% (still claiming black ahead)
B o7 +4.1pp -> 67.5% (unbelievable)
W p7 +13.9pp -> 53.6% (black is clearly not ahead!)
B n12 -40.0pp -> 13.6% (close)

Only now does the evaluation reflect the death of black, seemingly blaming it on the tenuki.

I downloaded the same network 9006c708 to reproduce these outrageous results (with Leela 0.17). What I found is that Leela quickly and correctly judges m6 to be a mistake. After this move, black is around 20% with only a few hundred playouts, less than half of the 1600 that OGS claims to have used.

Whatever analysis is running on OGS’ servers, I suspect that the numbers get lost along the way and don’t make it into the final output. The percentages shown seem more fitting for a “first-glance” evaluation without any playouts.

The green color highlights are suspicious. Even after a few hundred playouts, my Leela has already invested most of its reading effort into moves A, B and C.

On top of that, I have to agree with some previous criticisms on the display of moves B… onwards. The fact that these moves are considered does not mean much. Even after a few thousand playouts, Leela still thinks that move F yields 50% for black. It only sees how bad it gets when I actually play it. This is because Leela is not interested in exploring the second and third best moves. At no point does the engine calculate how bad these alternatives really are. Only the percentage of the top move is informative.

If the AI review was working correctly, we would only very rarely see positive percentage changes. It appears like Leela is constantly mindblown by all our genius moves - not a great impression from a supposedly strong Go player