How to Adapt WHR to Handicap Games

How is WHR Adapted to Handicap Games?

I originally posted this on the Arimaa Forum, but the concept equally applies to adapting WHR to handicap games (or reverse komi games) in Go. Feel free to reply on either platform; I will repost any content-heavy posts (with credit, of course) to the Arimaa Forum post.

How to Even Start?

Arimaa has 864 possible handicaps including the even game. Do each of the handicaps get a rating? Probably. How is it calculated? WHR as for the player ratings? Or would Baysian Elo be better since one does not expect the value of a given handicap when played at the same rating (handicaps will not have the same value in WHR rating points across different skill levels) to ever change, unlike player ratings?

Presumably each handicap might be associated with some function(rating) = handicapValue where “rating” is the average of the two players ratings, and “handicapValue” is the number of WHR rating points would be added to the rating of the player being given the handicap in order for the WHR system to correctly predict the win chance of both players without knowledge of the handicap? Thus when any two players play, the system would get their average/median rating, and plug it into the associated functions of each (optimizations definitely possible) handicap, and suggest the handicap for which the rating of the weaker player plus the result of the suggested handicap’s associated function is a close as possible to the rating of the higher ranked player?

Processing Games

Player Ratings

Once the game is played (and it might have been played at any handicap at all: even one wildly inappropriate like me giving browni3141 (the strongest Arimaa player today) 7 rabbits) two systems must appropriately handle the result. The normal WHR would need to take into account that the expected result has been influenced by the used handicap: and that handicap may not be at all what the system suggested: the system can not rely on all games being roughly even.

I am still struggling to grok or even understand WHR (and, more generally, Bayes’ Theorem), but my intuition says that this particular adjustment would be fairly trivial for someone who grokked vanilla WHR.

Handicap Ratings

The handicaps must also be adjusted, though. I feel that me trying to think of how to do this is like someone who has no idea how baseball works except that it uses a baseball bat, and deciding to try and run around clocking people over the head with it because maybe that’s how baseball works? Is there a parallel WHR system? Is it all the same system? are the rating points for the handicaps in a 1:1 correspondence with the rating points difference which they cover? If so, how is the different worth of the same handicap across skill levels handled? If not, how are the handicap ratings converted to and from suggested handicaps? How does one prevent degenerate behavior with two systems influencing eachother: seems like a recipe for feedback loops. Is WHR used or Baysian Elo or something else entirely?

Handicap Relative Size

Presumably a solid system such as WHR or Baysian Elo would correctly evaluate the strength of different handicaps relatively quickly and accurately, but with 864 handicaps to rate, most of them absurd, and no domain-specific knowledge of their relation to eachother before WHR starts to work its magic one game at a time, I fear it would take a lot of seemingly random suggested handicaps before the system learned that 7 rabbits 2 horses and an elephant ([7, 0, 0, 2, 0, 1])is not much more appropriate a handicap for most matchups than 6 rabbits 2 horses and a camel ([6, 0, 0, 2, 1, 0]) would be. But some handicaps, such as the two seen above, are just straight up better than others. I do not know how or even if such knowledge could be used to bootstrap the system (maybe setting the initial ratings roughly or seeding the system with some sample games demonstrating the relative strengths of handicaps for the initial prior?) but the following metric seems like it would go a long way towards reducing the number of games which would need to be played against crazy handicaps before the vast majority of them gained enough rating to virtually never be seen suggested, and to be decently accurate when they were used anyway:

let h1, h2 be handicaps denoted by arrays a1[6], a2[6] respectively.

let subtraction of two arrays be defined as returning an array with length equal to the longer of the two input arrays (extra 0s being imagined to be appended to the shorter array) with out[n] = arrayIn1[n] - arrayIn2[n] for each n in out[]. I don’t know if that makes sense as subtraction, but call it something else if you like; this definition is useful here regardless of what it’s called.

let tempArray = a1 - a2

for (int x = 5; x >=0; x--) {
 if tempArray[x] != 0 {
   if tempArray[x] < 0 {
break; // (not sure if break works like this; but this is all pseudo-code at best anyway): exit for loop: h1 cannot be said to be strictly better than h2 by this metric.
   } else {
tempArray[x] = tempArray[x] - 1;

for (int y = x; y <= 0; y--) {
  if tempArray[y] < 0 {
     tempArray[y] = tempArray[y] + 1;
  break; // again, probably wrong syntax: my intent is to break out of the inner for loop but not the outer here
  } // end if
} // end inner for
x++; // counteract the decrementing of the counter because I shouldn't have used a for loop here. :D
   } // end < 0 if-then
 } // end != 0 if-then
} // end outer for

There’s definitely a cleaner way to do this (syntax highlighting and debugging might help, lol) but the idea is that once it exits, if tempArray has no negative values, then h1 can be said to be strictly better than h2 by this metric. It of course doesn’t work if h1 = h2, but why would you test that?

If applied to every possible pair of handicaps, this will create a lot of directed paths between handicaps where the arrow is pointing towards a handicap strictly smaller than the handicap the arrow originates from. Since all this is transitive, redundant arrows can be skipped. Perhaps this can somehow expedite the process of rating the handicaps?


Does anyone have any idea how to do this? I’m sure someone must have implemented it, maybe for Go handicap stones or something, but I can’t find anything by googling. Any links to good explanations of WHR would also be awesome if you have them on hand; maybe something will click with me. How did everyone else begin to grok WHR? Would going line by line through one of the implementations of it (it seems there’s a Ruby one and a Python adaption of the same one on github) be helpful, or would I just be wasting my time if I don’t understand the math from Column’s paper first?


Please could you just introduce me what is WHR? What is arimaa?

1 Like

WHR is “Whole History Rating” and is used in among other places. It was created by Remi Column and is generally more accurate than other systems (such as the glicko-2 which OGS uses). It is based on Bayes’ Formula, but beyond that I have grokked very little.

Arimaa is an abstract which looks analogous to chess at first glance, but has a lot of similarities to Go in the high state space, branching factor, elegant rules, and deference of tactics to strategy. But the question could easily be reworded to be about integrating reverse komi and handicap stones into WHR. In actual fact, that is what I am most interested in currently: its application to Go.


Interesting. In what ways it’s more accurate?

It predicts the winner of a game with greater accuracy than competing systems. Its only downside is that it is unintuitive: if OGS used it, we would long for the days when we only had 2 or 3 threads about our ranking system per week. :smiley:


Arimaa has 864 possible handicaps including the even game


This Wikipedia sub-article on handicaps in chess is quite interesting:

Some of the more inventive handicaps listed are:

  • The weaker player’s king may move up to two squares in any direction in a straight line.

  • Checkmate on a particular square, or with a pawn

  • Giving all the pieces (ie. keeping king & pawns) in exchange for playing two moves each time

  • Giving the king or queen an extra ability to move like the knight

  • Weaker players begins already castled


As long as they can be listed, they should be able to be taken into account by an extension of WHR. I simply considered sacrificing some number of pieces on the first few moves since that is the common form of handicapping in Arimaa.

You can label your breaks, depending on which programming language that is.

It’s not any particular language, unfortunately. Honestly, I should rewrite that whole block. Even I can tell it’s ugly.

For a long post…

  • If you have anything parallel going on, it’s easier to read bulleted/numbered lists or sub-sections.
  • When there’s dependency (“if this idea is correct, then this question arises …”), sub-sub-sections or collapsable details (and maybe some other ways?) might be helpful.

For code…

  • It might be easier to understand if motivation/outline (“once it exits …”, “If applied …”) came before code itself.
  • You could probably use more natural language.
    *for (int x = 5; x >= 0; x--) ->
    for each integer from 5 to 0 or
    for each item in <array-name> in reverse order
  • Split comment into multiple lines so scrolling side-ways is unnecessary.

Backgrounds for Whole-History Rating aka WHR (Coulom):

  • Bradley-Terry: Intuitive Explanation (McCaffrey)
  • dynamic-/time-varying: assuming players learn or get tired etc, and their strengths change over time
  • “maximum a posteriori”
  • decayed-history: older games’ weights decay over time–“progressively forgotten”
    • the first 2 pages of the Coulom paper(pdf) summarizes various rating-system-categories, including decayed-history

Probably due to lack of math knowledge, I still don’t understand how WHR works though. I might be able to understand if there was a worked example with small number of players.

By “better” do you mean the advantage is not too great? I’m barely familiar with the rule and have never played Arimaa.

Can we directly set non-uniform prior (so that graph looks reasonable for Arimaa experts, for example) for speeding up the process? See Automatic data reweighting (Cook).

What’s the meaning of number “6” before the code? Is it 6 people or 6 handicap varieties or something else?

Difference in rating -(?)-> handicap. For example, 1700 vs 1500 and 1500 vs 1300: should these pairs use the same handicap?


Good points, thank you; they’re still obvious once you say them, but I can see that the OP could be made quite a bit more readable by wider application of them.

Thank you for the references; I will check them out in a few minutes when I’m on my computer rather than my phone.

I mean better in the sense that a 7 stone handicap is just straight up better than a 6 stone handicap, and you could tell this even if you had never played either.

Thanks again for the reference; I’ll check it out in a few minutes.

I’m not 100% sure where you mean, but if you’re talking about when I’m defining the array lengths, that’s just because Arimaa has six piece types, so I gave the arrays 6 indices so handicaps could be represented just by listing the number of pieces the stronger side sacrificed. In Go you probably just need arrays of length 3: one for the number of handicap stones, another for the value of komi, and a third which is 0 if the stronger player takes White, and 1 otherwise.

I don’t think so. We notice the same thing in Go, that as players get stronger, handicap stones start to become worth more. That’s why the relationship between ranks and ratings on OGS is not a simple linear relationship.


For a moment let’s ignore implementation details and optimization tricks and try to understand conceptually.

Let’s say we have 4 players A, B, C and D. Suppose their true–but hidden–relative strengths are:


To estimate them, we let the players play a bunch of even games. Player A probably loses all games while D probably wins all games. Between B and C, it may be like 4 : 6.

With Bayesian we might get something like:


If we add more games I imagine the pair of hills would get sharper and sharper–like this Dirac-delta animation. I wonder what would happen to the other two.

If we want single-number estimates of ratings, we read off the tops of the hills. For A and D, we’d get the endpoints of the whole rating interval (?).

Now, we eventually want to take into account:

  • handicaps
  • players’ strengths changing over time

But before we go further, let’s check if we got the right picture up to this point—I rub my lamp, summoning @ɱekriff . . .


(Blind leads the blind episode 2)

Let’s try to figure out what happens if we add handicaps. For this let’s consider Go handicaps specifically because I don’t have intuition for Arimaa.

Suppose we have 4 handicap varieties—2 stones, 3 stones, 6 stones and 9 stones.

Here let me quickly define some terms:

  • midpoint of 1300 and 1700 is 1500
  • half-width of them is 200

Then with Bayesian I think—for each handicap variety—we get a surface like:


I imagine that the surface—instead of having multiple isolated hills—has a single ridge. If we imagine that there’s a spine through that ridge and we take an x-ray photo from above, we might get something like:


If we combine “x-ray photos” of our 4 handicap-variety surfaces, we might get something like:


Anyways, if we have all 4 surfaces and want to suggest an appropriate handicap for a pair of players, we might compare the heights of the surfaces at that point and pick the highest one (?).

At this point couple of @Samraku 's questions become relevant (I don’t know the answer):


I don’t know how WHR works in detail. But I imagine it to be a black box with output that is very much like that of an Elo rating system.

I think the black box does more iterations than a normal Elo rating system (perhaps somewhat similar to Glicko2?) and I think it also uses back propagation (updating past ratings), unlike Elo rating systems. But whatever the details of those added calculations, I assume that WHR rating differences still correspond to winning probabilities, very similar to Elo rating differences.

I can give empirical formulas to convert handicaps/rank differences to winning probabilities (or equivalently, Elo rating differences, and perhaps WHR rating differences if my above assumption is true). It comes from the historical data of the European Go Database (a database of 1 million live tournament games played in Europe since 1996). See and Rating system issues

Note that my formulas will not match OGS handicap/rank-rating conversion formulas.

1 Like

This looks right to me.

For those wanting a general understanding:

At a high level, as I understand it, WHR is a “bayesian maximum likelihood”-based rating system (I think strictly speaking, you’d call it maximum a-posteriori). Essentially, it defines a particular model where every player has a “rating” that obeys certain properties, and it specifies some functional forms for the following:

  1. Given two players, if they play a game, what is the probability of a win or loss based as a function of their ratings? (I believe WHR uses a “bradley terry model” here, which is the basis of many Elo-like rating systems.)
  2. Given a player, what is the probability distribution over ways that their rating may evolve over time? (I believe WHR specifies a model where ratings drift over time as a brownian motion, if no further evidence about them is obtained.
  3. Additionally, what prior beliefs or evidence do you have about players’ ratings, or difference between certain players’ ratings?

And you also may have to tune parameters for these things. The second part in particular is very important to tune right - it’s the part that makes it where even if you play a million games today and establish absolute statistical confidence about your exact rating, tomorrow your rating can still move based on new games (new evidence) because the system will model your rating as potentially having drifted.

With those pieces in place, WHR simply throws Bayes’s theorem at the problem: find the global assignment of ratings to all players over all possible points in history such that under the above model, it is set of ratings that is most likely given every game result you’ve ever observed. This is becomes a massive numerical optimization problem, but it turns out to be computationally tractable if you make good choices of functional form and write a good implementation.

Unlike Elo systems where players “trade points with each other” according to certain formulas after each game, Bayes-style systems are a bit more opaque. It is impossible for anyone by hand to calculate what effects a new datapoint will have on a rating, and it will move around based on all the data the system takes in, including even if you aren’t playing games yourself.

BUT - bayes-based systems often automatically give you a lot of nice properties for free, and actually are often fairly intuitive at a high level if you approach it from the perspective of “game results are evidence about strengths of players” and ask yourself: how would this evidence affect a system’s beliefs about those strengths, if it took everything into account?

For example:

  • Winning a game can never reduce a rating. Why? A higher rating only ever increases your modeled winning chance in any game. Therefore, the evidence of you winning a game can only ever make higher ratings more likely relative to lower ratings.
  • If you played an equal-rated opponent yesterday and lost and your rating went down a bit, but that opponent unexpectedly won against even stronger people today, all-else-equal your rating will partially or mostly go back up. Why? Because the opponent’s games today is new evidence that they were stronger than they first appeared, therefore your loss yesterday is no longer as strong of evidence of you being weaker.
  • If two-equal rated players play, but one hasn’t played in a long time, while the other plays regularly, the one who hasn’t played in a long time will probably move in rating more after the game. Why? Because one new game result is relatively a much bigger proportion of the evidence about a player who has very few recent games than it is for someone who plays regularly.

And so on.

Bradley-Terry models, which WHR builds from, can easily be extended to have more thanone variable determining a player’s rating in a game. This is useful for team games (i.e. the team is a “player” whose rating depends on the multiple variables of its individual actual players), but it can also be useful for things like handicap. In theory you can model the handicap as an unknown variable that additively affects a player’s rating (or even affects a player’s rating variably as a function of their baseline rating), and throw bayes max-likelihood at the problem and have it also estimate on its own how much each handicap matters, or how much it matters as a function of the player’s baseline rating, and it will come up with the optimal thing given the all the data, within the limits of your functional form.

Then, the only difficulties are if you can find a functional form that is realistic and works well enough in practice, and how well you can implement the optimization. So yes - this is doable with work - you “only” have to specify how this additional variable “handicap” may affect 1, 2, and 3 above and make sure the optimizer still runs fast enough to be tractable. Then your optimizer should do as it always does - find the most likely possible joint assignment of ratings and handicap impact coefficients automatically for you, given all the data and any prior beliefs you give it.


Something to look forward to:

1 Like

I have – towards the end of my go hiatus – taken note of this thread, idk if you want me to do a real revival of this topic, as I’d imagine getting a proper “model” for how stones influence the game (beyond what OGS already does for rank-rating conversion) would be somewhat involved, as the question of exactly how skill interacts with the power of handicap is a somewhat interesting question.

But as for the post where I was mentioned, I’m not sure what you mean by “rating interval” as generally Elo and Glicko (and WHR, giving outputs on the same scale) are unbounded. A and D would progressively get further and further out the more games that were played, but generally the peaks will get sharper for all of them (although A and D would probably be wider than B and C) under the normal prior distribution that Glickman uses for Glicko (and I believe WHR also uses a normal distribution prior before the virtual games)

however I’d argue that applying handicap to WHR is exactly the same as applying it to any other rating system using the Expected Score formula, as it would simply involve modifying that formula (and possibly some derivative formulas from it)

1 Like