I do wonder if there could be some unwanted behaviour that, when looked at in aggregate, give nice averages.
It would be interesting to look at individuals. Like, take the set of individuals who have a reasonably established rank on two different board sizes, then see how many of those people have their wins better predicted by the overall rating, and how many have their wins better predicted by individual board sizes.
Maybe it would be interesting to see a tighter deviation limit than 120? It might also be interesting to only look at games after the new handicap & komi values started to be used for 9x9 and 13x13 boards.