Can you cheat the Brier score?

Published on

The Brier score is a common way of judging probabilistic forecasts. If you have several people or teams each giving probabilities to various events, you can judge how well they are doing by comparing their Brier scores: lower scores correspond to more accurate predictions.

However, there is a critical assumption behind the Brier score: that all forecasters compete on the same set of events. Some questions are easier to predict than others. And I’m not talking about the domain-specific difficulty—the sort of difficulty that arises when one wants to predict whether there will be a revolution in Turkey without knowing much about the Turkish politics.

It turns out that the difficulty of achieving a certain Brier score is directly related to the actual probability of the event. Intuitively we already understand that: it is easier to “predict” whether a plane will crash (it probably won’t) than to predict whether it will arrive late (who knows?).

Except here we are dealing with probabilistic predictions (not a rigid yes or no) and a specific scoring rule (the Brier score), so our intuitions do not necessarily transfer. And yet the basic result is the same: it is easier to achieve a low (i.e. good) Brier score betting on highly probable and highly improbable events than betting on uncertain events.

What made me look into this was this episode of the Rationally Speaking podcast, where Julia Galef speaks with Anthony Aguirre about his prediction platform Metaculus.

At one point the dialogue goes as follows:

Julia:

Oh, how do you measure how well it works?

Anthony:

[…] So looking over sort of the last half year or so, since December 1st, for example… If you ask for how many predictions was Metaculus on the right side of 50% — above 50% if it happened or below 50% if it didn’t happen — that happens 77 out of 81 times the question resolved, so that’s quite good.

And some of the aficionados will know about Brier scores. That’s sort of the fairly easy to understand way to do it, which is that you assign a zero if something doesn’t happen, and a one if something does happen. Then you take the difference between the predicted probability and that number. So if you predict at 20% and it didn’t happen, you’d take that as a .2, or if it’s 80% and it does happen and that’s also a .2, because it’s a difference between the 80% and a one, and then you square that number.

So Brier scores can run from basically zero to one, where low numbers are good. And if you calculate that for that same set of 80 questions, it’s .072, which is a pretty good score.

Julia:

Yeah, I mean I don’t really have a frame of reference — like, I don’t know what, an average college educated person guessing about these questions would be expected to guess, or to get for their Brier score. But it certainly seems low.

Anthony:

So that’s something that you can sort of … So, some references are a .25 is what you would get if you guessed 50% for everything. So that’s sort of totally uncertain. A .33 is what you would get if you just randomly assigned a probability between zero and one for everything. It’s slightly different things.

If you pay attention, there’s already a hint of the problem in this short exchange. You can get a Brier score of .25 quite easily by always predicting 50%. But if the true probability is indeed 50%, then you have to predict 50%—any other prediction will get you an even worse score.

Say I set up a platform analogous to Metaculus—I’ll call it Metaculus Prime—and want to beat Metaculus’s Brier score. All I’d need to do is to formulate slightly different versions of Metaculus questions. Currently, the #1 question on Metaculus is this: “Will SpaceX land people on Mars prior to 2030?” On Metaculus Prime, I’ll also allow betting on SpaceX landing on Mars, except I’ll ask “Will SpaceX land people on Mars prior to 2020?”. Because the answer to this modified question is more certain, a person of the same predictive skill would be expected to score better on it than on the original, less certain question.

When predicting the more certain events, one still has an incentive to get the exact probability right (to the extent such a thing exists). But for a given difference between the true and predicted probabilities, there is an extra penalty for probabilities close to 0.5, as I demonstrate below.

To be clear, this is not to criticize Anthony—it was really interesting for me to find out Metaculus’s consensus Brier score, and it doesn’t look at all like he’s manipulating the questions to get a low Brier score. But we should also recognize that the Brier score is not a valid metric to compare forecasters who compete on different questions or different platforms.

It is also interesting to think about how we should design the prediction questions. In a question like “Will the number of electric cars in the world be greater than \(X\) by the end of 2020”, how should we pick \(X\)? If \(X\) is very high or very low, the outcome will be obvious, and there will also be an incentive to make more extreme predictions (closer to 0 or 1 than warranted) since you’re unlikely to be proven wrong in the short term.

On the other hand, pick \(X\) “right in the middle”, and the best Brier score anyone can achieve on average becomes the unimpressive 0.25.

Mathematical details

Consider an event \(X\) that may happen \((X=1)\) with probability p and not happen \((X=0)\) with probability \(1-p\). Say you predict the probability of \(X\) happening being \(q\). Then the Brier score is calculated as \((X-q)^2\). For multiple events, take the average of the individual Brier scores.

The expectation of \((X-q)^2\) is

\[\begin{equation} \operatorname{E}[(X-q)^2] = (p-q)^2+\operatorname{Var}(X). \label{eq1} \end{equation}\]

For a given event \(X\), \(\operatorname{Var}(X)\) is a constant, and to minimize the Brier score you should pick \(q=p\).

But if you have a choice of which events \(X\) to bet on, it also makes sense to try and minimize the second term, \(\operatorname{Var}(X)\). The variance of a binary (“Bernoulli”) variable \(X\) is given by \(\operatorname{Var}(X)=p(1-p)\). It is maximal when \(p=0.5\) and falls as \(p\) moves towards 0 or 1.

The variance of a 0-1 variable X as a function of its mean (the probability of 1) p
The variance of a 0-1 variable \(X\) as a function of its mean (the probability of 1) \(p\)

Additional considerations

There are two other considerations that exacerbate this effect. First is that a given absolute error in probabilities—say, 5 percentage points—is actually more severe for probabilities closer to 0 or 1, so if anything, the errors should be penalized more for highly certain events, not less. This could be seen, for instance, on the log-odds scale: the difference between 0.6 and 0.65 is only 0.21, while the difference between 0.9 and 0.95 is 0.75.

The other consideration is that if I tend to assign the probability of 1 for 99% events—a serious error, if you ask me—it will take quite a few predictions to prove me wrong. In the short term, I will actually be rewarded for making more “certain” forecasts for rare events.

Most of this article assumes that there exists some inherent uncertainty in the predicted event. The alternative view is that most of events (barring quantum experiments and chaotic systems) can be predicted exactly by a skilled enough team and given enough resources. While in this case it doesn’t make sense to talk about the true probability of an event, the fact that different events may be of different difficulty becomes even more obvious.