We’ve referenced it under different names: “binomial deviance”, “shannon entropy”, etc.
I like to describe it in information-theoretic terms as the “average amount of information each data point provides in the model.” Information here being considered a bad thing because it’s a measure of “surprisal”, so if each data point provides less information on average, that indicates it’s doing a better job of predicting the outcome.
Now, to explain the logarithms, I’ll give this example:
It would be reasonable to say that the information provided by a flip of a fair coin is 1 bit, and the information provided by two said flips is 2 bits, three is 3 bits, et cetera…
Now, if we think about the probabilities behind these flips of increasing information, it increases linearly with multiplicatively decreasing probabilities. That is, the information provided by n flips is exactly 1 bit more than that of n-1 flips, while the probability of a result provided by n flips (assuming the order of flips is preserved) is exactly half of that of n-1 flips.
Luckily there’s a neat little class of function that can treats multiplicative ideas as if they were additive: and that is a logarithm. So naturally we can take the probabilities (p) of the result of these multi-coin flips, and run them through -log2(p) to then achieve a function that works the way we described above.
Now, of course, so far this only seems to be useful in flipping fair coins, but to expand this idea slightly, let’s imagine the case of two fair coin-flips where order is not preserved. So 1/4 of the time it’s two heads, another 1/4 of the time it’s two tails, and 1/2 of the time it’s one head and one tail.
Naturally it makes sense that in the two heads and two tails case, that they both still provide 2 bits of information when they occur, but when there’s one head and one tail, there are two different possible outcomes. One might argue that it’s a “coin flip” for which of the 2-bit outcomes actually happened. And so we will interpret this as the 1 head 1 tail outcome recovering only 1 bit of information, as it requires one more bit of information to recover the full effect of the two coin flips.
One would also notice that it having a probability and being one bit of information would not be a coincidence, and the theory is that you can extend a similar logic to any source of information via the formula -log2(p), and this becomes much more concrete in the realm of encryption and data compression, as it correlates to file sizes after compression, or during encryption as you consider the random sources of data loss.
Now, you probably saw it stated in the form of -[Y*LOG10(E) + (1-Y)*LOG10(1-E)] somewhere.
A difference you might notice is the use of a base ten logarithm, but this is merely a scaling factor. If that doesn’t make sense I would highly recommend playing around with various exponentials and logarithms, but the rule is loga(x) = logb(x)/logb(a).
But you might also notice those Y’s, and those Y’s are supposed to be 1 for a win, 0 for a loss, and usually 1/2 for a tie (in games where these occur), So whenever you win, the second term becomes 0 and after simplifying you get -log10(E), and when you lose, the first term becomes 0, so the result becomes -log10(1-E), where E is the model’s predicted probability of you winning. So if you average this out over every game, you obtain the average amount of information that the results of games provided separate from the model’s predictions.
There’s also a couple added-in benefits such as punishing the model infinitely for having 100% certainty and being wrong, but as I hear it this is a fairly standard test for a model’s accuracy