Thanks Mike, I would be interested to hear why you don't think ROC curves are an appropriate metric for assessing predictive competency. I watched the Philip Tetlock lecture and came out thinking that they would be.

Hey Remy, ok good question. I'd like to make a longer methodology post eventually and cover the pros and cons of different evaluation metrics, but for now I'll try to give a short explanation of why I don't think ROC curves are good for probabilistic forecasts.

So let's say there are 6 races, I'm trying to predict. I think that the Democrats will win all of them, but for the first 3 I'm extremely confident that the Democrats will win, and for the last 3 I'm only moderately confident.

So my forecast probabilities for the Democrats winning each race are:

Race 1: 0.95

Race2: 0.95

Race3: 0.95

Race4: 0.7

Race5: 0.7

Race6: 0.7

Then the election happens, and the Democrats win the first 3 races (we'll call this outcome 1 for the binary classification), and the Republicans win the last 3 races (we'll call this outcome 0):

Race1: Democrats win (1)

Race2: Democrats win (1)

Race3: Democrats win (1)

Race4: Republicans win (0)

Race5: Republicans win (0)

Race6: Republicans win (0)

Since ROC curve evaluations are based on setting thresholds to measure the tradeoffs between the true positive rate and the false positive rate, I'll get a perfect ROC curve (area under the curve = 1) for my predictions. This is because it's possible to set a threshold that correctly separates all the Democrat wins from Republican wins, with a true positive rate of 1 and a false positive rate of 0. This threshold could be 0.8, for example, or any other number between 0.7 and 0.95.

But even though this forecast earns me a perfect ROC curve... it really shouldn't. After all, I predicted with >0.5 probability that the Democrats would win all of the races, and they lost half of them.

So that's why ROC curves don't exactly make sense for probabilistic forecasts, where the probabilities aren't just arbitrary weights but actually mean something.

I didn't notice this at first, because I was just thinking of it as a binary classification problem, but when I did the preliminary analysis (before some of the later races were called), this scenario actually happened. A couple forecasts actually had perfect ROC curves, and I was like "Wait that doesn't make sense, none of the forecasts were perfect." I thought it was some error in my code, and then I realized this thing about how the ROC curves don't really make sense for this type of thing.

With that being said, I'm actually having second thoughts about if ROC curves are useful. This flaw with using them on probabilistic forecasts is a real problem, but I also realized that calibration plots are also kinda flawed, since they rely on arbitrary binning and can be misleading if the bins have a non-uniform distribution of samples. So now I think that both ROC curves and regular calibration plots are both kinda flawed for this topic, and both have strengths and weaknesses.

Anyway, I hope this explanation makes sense! If not, I'm planning to write a longer methodology post at some point going more in depth on the strengths and weaknesses of different metrics.

"This is a receiver operating characteristic curve. The blue function rising up here shows how much better superforecasters were at achieving hits at an acceptable cost in false alarms. If you had an equal base rate of events occurring and not occurring, a chance performance would be a straight diagonal across here."

Like with calibration plots I guess the most important thing is having enough samples to get meaningful output. I look forward to your longer methodology post on this!

Thanks for writing this. PredictIt skewed republican across the board, at least relative to 538.

For the election, I just went through the PredictIt markets and bet on those showing high discrepancies relative to 538 (e.g., 538 gave around 15% greater odds to democrats winning the NV senate relative to PredictIt). Made a tidy profit.

I think PredictIt has issues with very unlikely outcomes, inflating things in the 1-3% range to 5-10%. This is roughly evident in your spreadsheet. There is some strategy to this. I occasionally buys shares even if I believe they are correctly priced, reasoning: "This 4 c share is at its floor. It can't go lower than this, and even if it doesn't pan out, I can probably sell it at the same price in a few weeks.

In either case, even if you only look at markets where Predictit had them priced 11c to 89c, 538 would've certainly done better given that it overall leaned relatively Democrat.

Thanks! Yeah I agree, there were a lot of races that FiveThirtyEight had 0.99 - 0.01, and PredictIt seems to have them like 0.9 - 0.1 or even 0.8 - 0.2. If I remember correctly, the NY governor race was one of these. PredictIt gave the Democrats like a 75% chance to win, and FiveThirtyEight gave them like a 97% chance.

Thanks Mike, I would be interested to hear why you don't think ROC curves are an appropriate metric for assessing predictive competency. I watched the Philip Tetlock lecture and came out thinking that they would be.

Hey Remy, ok good question. I'd like to make a longer methodology post eventually and cover the pros and cons of different evaluation metrics, but for now I'll try to give a short explanation of why I don't think ROC curves are good for probabilistic forecasts.

So let's say there are 6 races, I'm trying to predict. I think that the Democrats will win all of them, but for the first 3 I'm extremely confident that the Democrats will win, and for the last 3 I'm only moderately confident.

So my forecast probabilities for the Democrats winning each race are:

Race 1: 0.95

Race2: 0.95

Race3: 0.95

Race4: 0.7

Race5: 0.7

Race6: 0.7

Then the election happens, and the Democrats win the first 3 races (we'll call this outcome 1 for the binary classification), and the Republicans win the last 3 races (we'll call this outcome 0):

Race1: Democrats win (1)

Race2: Democrats win (1)

Race3: Democrats win (1)

Race4: Republicans win (0)

Race5: Republicans win (0)

Race6: Republicans win (0)

Since ROC curve evaluations are based on setting thresholds to measure the tradeoffs between the true positive rate and the false positive rate, I'll get a perfect ROC curve (area under the curve = 1) for my predictions. This is because it's possible to set a threshold that correctly separates all the Democrat wins from Republican wins, with a true positive rate of 1 and a false positive rate of 0. This threshold could be 0.8, for example, or any other number between 0.7 and 0.95.

But even though this forecast earns me a perfect ROC curve... it really shouldn't. After all, I predicted with >0.5 probability that the Democrats would win all of the races, and they lost half of them.

So that's why ROC curves don't exactly make sense for probabilistic forecasts, where the probabilities aren't just arbitrary weights but actually mean something.

I didn't notice this at first, because I was just thinking of it as a binary classification problem, but when I did the preliminary analysis (before some of the later races were called), this scenario actually happened. A couple forecasts actually had perfect ROC curves, and I was like "Wait that doesn't make sense, none of the forecasts were perfect." I thought it was some error in my code, and then I realized this thing about how the ROC curves don't really make sense for this type of thing.

With that being said, I'm actually having second thoughts about if ROC curves are useful. This flaw with using them on probabilistic forecasts is a real problem, but I also realized that calibration plots are also kinda flawed, since they rely on arbitrary binning and can be misleading if the bins have a non-uniform distribution of samples. So now I think that both ROC curves and regular calibration plots are both kinda flawed for this topic, and both have strengths and weaknesses.

Anyway, I hope this explanation makes sense! If not, I'm planning to write a longer methodology post at some point going more in depth on the strengths and weaknesses of different metrics.

Thanks for the comment and for reading my post! 🙂

Thanks Mike for the response. I'd have to look further into what Tetlock is doing, but it seems like when using ROC's they might set the threshold at .5, assuming that for every question forecasting .5 is tantamount to saying the outcomes are equally likely. So estimates for races 3-6 should count as false alarms. From https://www.edge.org/conversation/philip_tetlock-edge-master-class-2015-a-short-course-in-superforecasting-class-ii:

"This is a receiver operating characteristic curve. The blue function rising up here shows how much better superforecasters were at achieving hits at an acceptable cost in false alarms. If you had an equal base rate of events occurring and not occurring, a chance performance would be a straight diagonal across here."

Like with calibration plots I guess the most important thing is having enough samples to get meaningful output. I look forward to your longer methodology post on this!

Another analysis of midterm forecasts from First Sigma, which includes more sites (including Polymarket and Metaculus), but fewer forecast questions:

https://firstsigma.substack.com/p/midterm-elections-forecast-comparison

Very interesting post to check out!

Thanks for writing this. PredictIt skewed republican across the board, at least relative to 538.

For the election, I just went through the PredictIt markets and bet on those showing high discrepancies relative to 538 (e.g., 538 gave around 15% greater odds to democrats winning the NV senate relative to PredictIt). Made a tidy profit.

I think PredictIt has issues with very unlikely outcomes, inflating things in the 1-3% range to 5-10%. This is roughly evident in your spreadsheet. There is some strategy to this. I occasionally buys shares even if I believe they are correctly priced, reasoning: "This 4 c share is at its floor. It can't go lower than this, and even if it doesn't pan out, I can probably sell it at the same price in a few weeks.

In either case, even if you only look at markets where Predictit had them priced 11c to 89c, 538 would've certainly done better given that it overall leaned relatively Democrat.

Thanks! Yeah I agree, there were a lot of races that FiveThirtyEight had 0.99 - 0.01, and PredictIt seems to have them like 0.9 - 0.1 or even 0.8 - 0.2. If I remember correctly, the NY governor race was one of these. PredictIt gave the Democrats like a 75% chance to win, and FiveThirtyEight gave them like a 97% chance.