Imagine two people – I’ll call them Partially-Psychic Pete and Just-A-Regular-Guy John – are playing a game where they take turns attempting to make predictions about a fair coin toss by forecasting the probability that it will come up heads.

I made a related comment on your post about 538 calibration which is basically: who would the superior gambler be? Partially-Psychic Pete or Just-A-Regular-Guy John?

At fair odds, presumably Pete...

But it gets interesting if the odds aren't even, right? Let's say it's like 1:99 odds tails and 99:1 odds heads (stupid bookie). John always takes the big payout and 50 times out of 100 wins. That probably beats Pete (who 40% of the time incorrectly takes the tiny payout). Right?

My math isn't strong enough here but I'm curious where the line is in terms of odds unevenness before John is favoured. My intuition says it's tightly related to Pete's overconfidence... I think I need to try some simulations or write this down properly...

Sorry about the late response! Right, if it's even odds then Partially-Psychic Pete have positive expected return on the bets while Just-A-Regular-Guy John would only be breaking even. But if we consider what types of bets they might be willing to take, Just-A-Regular-Guy John would probably be unwilling to take unfavorable odds in favor of tails (since he knows its 50/50), but Partially-Psychic Pete, since he's overestimating his own accuracy, could be talked into taking unfavorable odds to the point that he would have a negative expected return.

Great post Mike. Although I'm a fan of empirical track records, it seems that there are a lot of concerns that normal scoring rules don't capture. While it might not cover exactly the topic you touched on, I think you might enjoy this paper on Alignment Problems With Current Forecasting Platforms by Nuño Semperea and Alex Lawsen (https://arxiv.org/pdf/2106.11248.pdf). You might also enjoy my recent post about problems with forecasting tournaments (https://abstraction.substack.com/p/against-forecasting-tournaments).

Why do you prefer Brier scores to log odds? Or, perhaps a better phrasing: for what sorts of purposes do you think Brier scores are better, and for what sorts of purposes are log odds better?

That's a good question and unfortunately I don't really have a good answer for it. I haven't put a lot of thought into log odds scoring and its relative advantages and disadvantages, but I'll look into it.

There are obvious theoretical justifications for log scoring, and if you have large numbers of observations with a small amount of information each it's clearly the right thing to do, but I'm not sure if there are real-world reasons why you might prefer Brier score in some situations.

What are the theoretical advantages of log scoring over Brier? I know log scoring punishes over-confidence more than Brier, but that seems like a subjective preference rather than an objective advantage.

One major disadvantage of log scoring in real-world scenarios is that if someone gives a 1 or 0 probability forecast and gets it wrong, then you get a log(0) error when you try to calculate it. Actually even if they get it right, it'll still give a divide-by-zero error.

Now, you might say it's fair for someone to be punished with a negative-infinity score if they give a 1 or 0 forecast, but the inability to compute the score seems like a major disadvantage in real-world scenarios where people might be badly calibrated or just bad at making forecasts. With Brier scores on the other hand, if you forecast a 1 or 0 and get it wrong, you simply get the maximum possible punishment for that question, and the calculation turns out fine.

In terms of punishing 0/1 calls, normally when I use log scores I'm trying to distinguish between two hypotheses. If something would 100% definitely be true under a given hypothesis, and I observe that it's false, I can reject that hypothesis, pack up and go home, so that score going to infinity is a feature not a bug.

For judging "how good is this person at predicting outcomes" in the real world, I guess things are less clear. Log scores punish overconfidence really heavily compared to hedging bets, and while I generally view that as a really valuable intellectual discipline I can believe that there might be applications where it wasn't what you wanted.

But I still view "if we imagine all our observations as one big observation, Alice scores higher than Bob iff she assigned a higher probability to that observation" as a really strong argument in favour of log scores over anything else for most purposes.

The log score is the unique monotonic additive score, which means that in some sense its' the "objectively correct" way of scoring. It's the score you need if you want to apply Bayes' theorem.

By monotonic, I mean that if Alice assigned a higher probability to the observed outcome than Bob then she should score higher – or, to put it another way, if we're trying to distinguish between hypothesis A and hypothesis B then if we observe an outcome that's more likely under A than under B then we should adjust our beliefs in favour of hypothesis A.

By additive, I mean that scoring two independent events separately should give the same outcome as scoring them together. If we're flipping two coins we can either give a score for each flip, or we can treat it as one observation with four possible outcomes and assign a score for each of those, and we want those to have the same result.

Putting those together, we want score(p1p2) = score(p1) + score(p2), and the only score function with that property is the log score.

For an example of why I prefer log scores, imagine two bags, each containing two biased coins. The coins in bag 1 come up heads with probabilities 1/8,1/8, while the coins in bag two come up heads with probabilities ½, 1/2^64. Suppose we flip both coins from a randomly-selected bag, and see two heads. The probability of this happening under bag 1 is 2^-6, whereas under bag two it was 2^-65. The log scores correctly tell us that the first bag has scored 6 and the second has scored 65 so (because we're playing by golf rules) that was almost certainly bag 1. But if we use Brier scores (as I understand them – I'm much more used to log scores) then the first bag scores (7/8)^2 ~ 0.77, and the sceond scores (1/4+ (1-2^-64)^2)/2 ~ 0.625, so the Brier score is incorrectly suggesting that the second bag was more likely.

And you can construct a similar pathological example for any other score apart from the log.

I made a related comment on your post about 538 calibration which is basically: who would the superior gambler be? Partially-Psychic Pete or Just-A-Regular-Guy John?

At fair odds, presumably Pete...

But it gets interesting if the odds aren't even, right? Let's say it's like 1:99 odds tails and 99:1 odds heads (stupid bookie). John always takes the big payout and 50 times out of 100 wins. That probably beats Pete (who 40% of the time incorrectly takes the tiny payout). Right?

My math isn't strong enough here but I'm curious where the line is in terms of odds unevenness before John is favoured. My intuition says it's tightly related to Pete's overconfidence... I think I need to try some simulations or write this down properly...

Sorry about the late response! Right, if it's even odds then Partially-Psychic Pete have positive expected return on the bets while Just-A-Regular-Guy John would only be breaking even. But if we consider what types of bets they might be willing to take, Just-A-Regular-Guy John would probably be unwilling to take unfavorable odds in favor of tails (since he knows its 50/50), but Partially-Psychic Pete, since he's overestimating his own accuracy, could be talked into taking unfavorable odds to the point that he would have a negative expected return.

Great post Mike. Although I'm a fan of empirical track records, it seems that there are a lot of concerns that normal scoring rules don't capture. While it might not cover exactly the topic you touched on, I think you might enjoy this paper on Alignment Problems With Current Forecasting Platforms by Nuño Semperea and Alex Lawsen (https://arxiv.org/pdf/2106.11248.pdf). You might also enjoy my recent post about problems with forecasting tournaments (https://abstraction.substack.com/p/against-forecasting-tournaments).

Thanks! Sounds cool, I'll check it out.

Why do you prefer Brier scores to log odds? Or, perhaps a better phrasing: for what sorts of purposes do you think Brier scores are better, and for what sorts of purposes are log odds better?

That's a good question and unfortunately I don't really have a good answer for it. I haven't put a lot of thought into log odds scoring and its relative advantages and disadvantages, but I'll look into it.

There are obvious theoretical justifications for log scoring, and if you have large numbers of observations with a small amount of information each it's clearly the right thing to do, but I'm not sure if there are real-world reasons why you might prefer Brier score in some situations.

What are the theoretical advantages of log scoring over Brier? I know log scoring punishes over-confidence more than Brier, but that seems like a subjective preference rather than an objective advantage.

One major disadvantage of log scoring in real-world scenarios is that if someone gives a 1 or 0 probability forecast and gets it wrong, then you get a log(0) error when you try to calculate it. Actually even if they get it right, it'll still give a divide-by-zero error.

Now, you might say it's fair for someone to be punished with a negative-infinity score if they give a 1 or 0 forecast, but the inability to compute the score seems like a major disadvantage in real-world scenarios where people might be badly calibrated or just bad at making forecasts. With Brier scores on the other hand, if you forecast a 1 or 0 and get it wrong, you simply get the maximum possible punishment for that question, and the calculation turns out fine.

In terms of punishing 0/1 calls, normally when I use log scores I'm trying to distinguish between two hypotheses. If something would 100% definitely be true under a given hypothesis, and I observe that it's false, I can reject that hypothesis, pack up and go home, so that score going to infinity is a feature not a bug.

For judging "how good is this person at predicting outcomes" in the real world, I guess things are less clear. Log scores punish overconfidence really heavily compared to hedging bets, and while I generally view that as a really valuable intellectual discipline I can believe that there might be applications where it wasn't what you wanted.

But I still view "if we imagine all our observations as one big observation, Alice scores higher than Bob iff she assigned a higher probability to that observation" as a really strong argument in favour of log scores over anything else for most purposes.

Sorry about the late response! Thank you for the explanation, and yes those are some good points.

The log score is the unique monotonic additive score, which means that in some sense its' the "objectively correct" way of scoring. It's the score you need if you want to apply Bayes' theorem.

By monotonic, I mean that if Alice assigned a higher probability to the observed outcome than Bob then she should score higher – or, to put it another way, if we're trying to distinguish between hypothesis A and hypothesis B then if we observe an outcome that's more likely under A than under B then we should adjust our beliefs in favour of hypothesis A.

By additive, I mean that scoring two independent events separately should give the same outcome as scoring them together. If we're flipping two coins we can either give a score for each flip, or we can treat it as one observation with four possible outcomes and assign a score for each of those, and we want those to have the same result.

Putting those together, we want score(p1p2) = score(p1) + score(p2), and the only score function with that property is the log score.

For an example of why I prefer log scores, imagine two bags, each containing two biased coins. The coins in bag 1 come up heads with probabilities 1/8,1/8, while the coins in bag two come up heads with probabilities ½, 1/2^64. Suppose we flip both coins from a randomly-selected bag, and see two heads. The probability of this happening under bag 1 is 2^-6, whereas under bag two it was 2^-65. The log scores correctly tell us that the first bag has scored 6 and the second has scored 65 so (because we're playing by golf rules) that was almost certainly bag 1. But if we use Brier scores (as I understand them – I'm much more used to log scores) then the first bag scores (7/8)^2 ~ 0.77, and the sceond scores (1/4+ (1-2^-64)^2)/2 ~ 0.625, so the Brier score is incorrectly suggesting that the second bag was more likely.

And you can construct a similar pathological example for any other score apart from the log.