For some time now I've opined that I don't like accuracy as a forecaster performance metric because it's "too lumpy", and that I prefer, instead, mean absolute error.
Opinions can be interesting, but they're far less useful than the facts that emerge from a realistic analysis (though Inside Out taught us about the typical distinction).
So, I thought it was time I brought some facts to the table.
TERMS AND METHODOLOGY
Firstly, let's define our metrics:
- Accuracy is the percentage of games in which the forecaster selects the correct winning team. In games that are drawn we credit the forecaster with half a win. For a given season, accuracy ranges between 0% and 100%.
(NB Because our forecasters will make continuous forecasts there's no need to account for the vanishingly small probability that a forecaster will tip a draw.)
- Mean Absolute Error (MAE) is the average difference between a forecaster's predicted margin and the actual result
For the purposes of this analyses, we'll assume that seasons are 200 games long and that there are three season types differentiated by the profile of the expected margins for the pre-game favourites, as shown in the table at right.
In a Regular season, for example, there will be exactly 20 games with an expected victory margin for the favourite of 3 points, 30 games with an expected victory margin for the favourite of 9 points, and so on. The profile for the Regular season is roughly based on bookmaker pre-game margins for the 2016 and 2017 seasons combined. We could, instead, define seasons where every expected value between (say) 0 and 100 were possible but, as we'll see, the simplification made here is not material.
Competitive seasons see more games expected to be close and fewer games expected to be blowouts, while Blowout seasons see the opposite.
We will simulate 100,000 seasons of each type, generating actual results for each game in a season by assuming that it is drawn from a Normal distribution with a mean equal to the expected margin, and a standard deviation of 36 points. That's roughly consistent with empirical evidence. We'll round the numbers that are drawn in this way to make them integers (though, now I think about it, continuous scoring would do away with those troublesome drawn games that so irk some people).
Into this world we'll bring 10 tipsters, all of them unbiased in that their average predicted margins equal the true pre-game margin, but with different standard deviations around that mean.
The best tipster's forecasts will be drawn from a Normal distribution with a mean equal to the true pre-game mean and a standard deviation of 5 points. For the remaining nine tipsters, the second-best tipster will have a standard deviation of 6 points, the third-best tipster a standard deviation of 7 points, and so on, up to the worst tipster whose standard deviation will be 14 points.
There is, I think it's fair to say, a natural ordering of these tipsters based on the size of their standard deviation. In a Regular Season, on average, they'll produce MAEs ranging from 29.0 points per game for the best tipster, which would, historically, be highly competitive, to 30.8 points per game for the worst tipster.
The question is: how well will the results in a 200-game season using the accuracy or the MAE metric match that natural ordering? Or, put more simply, how often will the best tipster be found to be best, and the worst found to be worst?
The results for the 100,000 Regular seasons are shown below and reveal that the best tipster is ranked first on MAE in about 30% of season, which is about 3 times the level of chance. It finishes in the top 3 over 75% of the time.
The worst tipster finishes last about 43% of the time (and first a slightly startling 1% of the time.)
Using the Accuracy metric instead we find that the best tipster is ranked first only 20% of the time (or two times chance) and finishes top 3 less than 60% of the time.
Note that it is possible for tipsters to tie on the Accuracy metric. When this occurs, all tying tipsters are given the highest rank. So, some of the first places recorded by the best tipster in the table above will, in fact, be ties for first.
As well, the worst tipster avoids last place over 75% of the time, and rises to the top of the list about 1 time in 16 seasons.
So, while neither metric guarantees that the cream always rises to the top and the dregs to the bottom, MAE does a better job at sorting the wheat from the chaff than does Accuracy (though neither metric prevents mixed metaphors from being used in its comparison).
What, if any difference does it make if there are more close games - and hence possible "upsets" - in a season?
To explore this we can look at the results for the Competitive season profile, which appear below.
We see virtually no difference for the results using MAE, which is what we should expect given that in moving from a Regular to a Competitive season all we're doing is altering the profile of game means and not the spread of forecasts around those means for the 10 tipsters.
As it turns out, there's not a lot of difference for the results using Accuracy either, so the overall superiority of MAE over Accuracy remains about the same whether we have a Regular or a Competitive season as defined here.
For completeness, we look also at the results for the Blowout season profile. These appear below and are consistent with the results for both the Regular and the Competitive season profiles.
For interest's sake (and, to check that our simulations are behaving the way they should), let's look at the results for a Coin-Toss season - that is, one in which every game has an expected margin of zero. Those results appear below.
We find that we again get much the same results for the MAE metric and that, as we'd expect because no tipster is biased, all tipsters are equally likely to finish ranked in any of the 10 places. Put another way, in a Coin-Toss season, being closer, on average, to the true expected margin of zero provides no utility when the performance metric is Accuracy.
Clearly though, you have to skew the results a long way towards the Coin-Toss profile before this phenomenon is discernible - our Competitive season profile gets us nowhere near this result.
To the extent that the simulation environment I've defined is a reasonable analogy for the real world, I can now put some numbers around the genuine superiority of MAE over Accuracy as a forecaster performance metric in a "regular" 200-game season.
What's more, that superiority seems to be of roughly the same magnitude if we assume that a season is slightly more or considerably less competitive than a "regular" season.
Neither metric is perfect, and sometimes the first shall be last, and the last shall be first under either of them, but this will happen less often is we use MAE rather than Accuracy.