[resolved (I think)] Ongoing AI analysis issues

Hi All,

Just letting you know we’re aware and actively working on the slow AI processing times and the slow auto-scoring issues. I’m hoping to have a fix up today, but realistically it might be another day or so.

For the curious, the root of the issue is that we recently upgraded our underlying AI analysis engine to the latest and greatest KataGo 12.4, which for the most part works well, however there’s some edge cases that it’s crashing on and since our system retries a position when a crash happens (up to a certain limit), periodically a problematic game enters the system and crashes all instances of our analysis engine, which then take many seconds to start back up before crashing again. This goes on for awhile until we finally give up on all the moves in the game. In-between crashes we are making a little bit of progress, but during these times the processing backlog for games grows a lot.

Anyways, I think I’m close to having at least a workaround for the crash in place, sorry for the extended time in getting the problem under control.

– anoek

11 Likes

In my very humble opinion, OGS should not deploy the new Kata engine until the crashes are fixed.

Correctness first.


Ian

2 Likes

Oh believe me I know… everything seemed fine for a long while which is why I proceeded with the update for all analysis servers, and unfortunately rolling back wasn’t really feasible in this case.

Anyways, good news is that I have a mitigation in place and it seems to be holding for now, so we might be in the clear, though I’m still monitoring to be sure.

4 Likes

Did I spot a volunteer to set up testcases for the AI?

Let me know, I’ll work with you to get them in place.

2 Likes

Hmm, I’d have thought KataGo had its own test cases. Is that so? I’m sincerely curious.


Ian

2 Likes

Digging deeper, the belief is currently that it’s an nvidia bug, TensorRT sometimes hits "nonfinite for policy sum" error · Issue #694 · lightvector/KataGo · GitHub for the more complete thread. I probably would have switched to the CUDA version had I realized I was dealing with a known issue, but by the time I debugged enough to realize what I was dealing with I already had a workaround that seems to be holding steady, so for now I’ve got the hole plugged while the powers that be at nvidia sort stuff out.

It’s a good question - but it seems like the answer is “if they do, the testcases aren’t broad enough to catch these crashing corner cases” eh…

… which doesn’t bode well for us being able to do so either.