Without verifying closely, I assume the 64 visit vs 8 visit thing relates to the timing of the bugfixes for KataGo’s handling of low-playout searches to override a bad policy. The authors of the paper were great and responsive to talk to in private correspondence, which included some investigation of this bug, and I’m thankful for them for helping to discover it so that I could fix it.
Unlike the cyclic group attack, which gives positions that are consistently misevaluated by the net where net is entirely blind to the concept, the pass attack was sort of just reliant on the raw policy being a bit fuzzy and non-robust and sometimes putting some mass on the pass move even though the evaluations were fine. For example passing would instantly be evaluated as much worse than playing a move, as far as I was aware. But if you’re doing a very low playout search, even if the algorithm “instantly” knows the correct values, you still might choose the bad move.
Why? Well, the most plain vanilla-possible MCTS implementation simply chooses moves proportional to the distribution of visits raised to some power. So perhaps you do 6 visits and you put 3 into the pass move and 2 into move A and 1 into move B. And suppose both move A and B are correctly evaluated as totally winning while passing is evaluated as totally losing. Nonetheless, passing received the most visits (3, instead of 2 or 1), so you still most likely going to pass.
Obviously this is stupid, yes you could do something much better (and KataGo does do something a little better, although pre-bugfix occasionally it could actually do worse than vanilla instead!), but still it’s not worth adding a lot of complexity to try to tune exactly the right logic for very low visits. Good move selection algorithms for MCTS mathematically are often derived from statistical methods - estimating variance of the values on moves along with modeling MCTS as a multi-armed-bandit problem, etc. But of course if you know anything about statistics you know that they generally are optimized for the case of many samples and give meaningless results with very few samples. For example, if there were a poll to predict the outcome of a local government election, what kind of statistics would you try to do to analyze the poll if the poll only sampled and asked a grand total of 6 random people, instead of thousands?
The answer is you wouldn’t, you would just get more samples (i.e. use more visits). If you do more visits, moves A and B will trivially overcome passing due to their winning evaluations, and then when further raised to a decent power the chance to pass becomes negligble.
Anyways, this is why trying to exactly optimize against and beat very low-visit MCTS is a slightly weird thing to do - the algorithm was never intended to be good in that case, and it’s very sensitive to the rate at which exact parameters of your algorithm break down as you get to too-low-sample sizes for statistics to work well. Even if in “normal” board positions it just so happens to still be “professional” level.
You do have to take some nuance and care in interpreting results from down-tuning a system to a given “level” (e.g. the number of visits that produces “professional” level play) and then showing flaws at that level. Like, suppose with more training a bot became “10x stronger” (whatever that means, however you measure that) uniformly across every fixed number of visits for normal board positions, but only “2x more robust” in how it handles outlier weird positions. Then, in the big picture, this a strict improvement in everything - both normal strength, and in robustness no matter your compute budget. But the bot may have gotten worse if you are measuring “how robust is it at the amount of compute that results in it being such and such level” if the 10x causes the amount of compute at which that level is reached to drop by more than the robustness improved.
At least, it’s been fun to think about and talk things with people, the authors of this this work heartily included, and I also think it’s important to figure this stuff out especially for deploying automated systems in the real world. Even if you deploy an AI system that is “human level” in some overall metric, it might be worse than “human level” in individual areas, especially if you correspondingly try to cut costs as it improves you may even turn improvements into regressions you didn’t realize after the cut.