I've spoken to quite a few fellow-modellers about the process of creating and optimising models for forecasting the results of AFL games. Often, the topic of what performance metric to optimise arises.
Two metrics come consistently as candidates:
- Accuracy, or the percentage of forecasts that predict the correct winner
- Mean Absolute Error (MAE), or the average difference between the actual and forecast margin
Mean Squared Error (MSE) comes up too, but that always seems a poor choice to me in an AFL context because of the heavy weight it puts on blowout results. In any case, the arguments for Accuracy vs MAE as performance metrics apply equally well to Accuracy v MSE.
Today I want to explore, by simulation, the relative efficacy of the Accuracy and MAE metrics specifically in an AFL forecasting context, though the results do generalise.
To do this I'm going to assume that, for what I'll call a 'typical' season:
- The Expected Home team margin in a game can be modelled as a Normal variable with mean 5.7 and SD of 29.8, which is consistent with the distribution of bookmaker handicaps over the past decade.
- The Actual Home team margin in a game can be modelled as a Normal variable with mean equal to the Expected Home team margin and an SD of 36 points (which is broadly consistent with what we've found, empirically, before)
- There are 200 games in a season, for each of which an Expected Home team margin is drawn independently and then this Expected Home team margin used to generate an Actual Home team margin
I'll place four different tipsters into this environment:
- A 'perfect' tipster, whose forecasts are always equal to the actual Expected Home team margin in a game
- A 'good' tipster, whose forecasts are drawn from a Normal distribution with mean equal to the actual Expected Home team margin in a game with an SD of 6 points. So, on average, this tipster gives forecast margins that are equal to the true Expected Home team margin, but she does so with some random variability around that value
- A 'poor' tipster, whose forecasts are drawn from a Normal distribution with mean equal to the actual Expected Home team margin in a game plus a fixed bias of +3 points and with an SD of 12 points. On average this tipster gives forecast margins that are 3 points higher than the true Expected Home team margin, and with a little more variability around that average than the good tipster
- A 'very poor' tipster, whose forecasts are drawn from a Normal distribution with mean equal to the actual Expected Home team margin in a game plus a fixed bias of +6 points and with an SD of 18 points. On average this tipster gives forecast margins that are 6 points higher than the true Expected Home team margin, and with quite a lot of variability around that average
These four tipsters can readily and objectively be ranked: perfect, good, poor, very poor.
What we'll investigate using the simulations is how often each finishes in its correct placing across 10,000 replicates of the season described. The results appear below.
On the left we have the results for the Accuracy metric, which by its lumpy nature sometimes produces ties. We see here that the perfect tipster finishes outright 1st across the group of four only about 28% of the time, and finishes tied for 1st slightly less often. In more than one replicate in 8 it finishes 3rd or lower.
The very poor forecaster, meantime, finishes top or equal top in almost one replicate in 12.
This is not as is should be.
In contrast, the MAE metric sees the perfect tipster finishing top over 80% of the time, and 2nd in virtually every other replicate. The very poor forecaster finishes third or fourth in all but a handful of replicates.
So, fairly clearly, the MAE metric does a much better job of identifying forecasting ability than does the accuracy metric.
Lastly, let's look at what happens in what I'll call a 'surprising' season, which is one where the SD of home team margins increases by an extra goal from 36 to 42 points.
The results for this scenario appear below.
Greater variability in game outcomes leads to greater divergence from the optimal rankings whether we use accuracy or MAE, but to a lesser extent for the latter metric in at least some senses.
For example, the very poor tipster now finishes 1st or tied for first on the accuracy metric in about an additional 2% of replicates (10.2% vs 8.1%) but does the same on the MAE metric in only an additional 0.2% (0.3% vs 0.1%) of replicates.
As well, even in this environment of higher variability, the perfect tipster tops the MAE rankings 75% of the time and finishes second in almost all of the rest.
Though it seems natural to measure a football forecasting model in terms of the number of winners it correctly predicts, the MAE metric provides a much superior measure of a model's underlying forecasting ability.