Consider the following scenario. You're offered a bet in which you can choose to predict the final score of the Home team or of the Away team and your adversary is then required to predict the final score of the other team. Assuming that the winner of the wager is determined on the basis of whichever guess is closest to the actual team score:
- For which team should you decide to predict the final score?
- How might you use the usual MAFL data - team ratings, bookmaker prices and the game's interstate status - to form a prediction?
A first possible clue comes from reviewing the scores for Home and Away teams over the last 12 seasons, summarised in the following chart.
We see that the variability of Home team scores is very marginally greater than that of Away team scores, so our instinct might be to choose to predict the less-variable Away team score.
(In passing, also note that Away teams, on average, score about one-and-a-half goals fewer than their Home team opponents, so we'll need to make sure that we take this difference in mean into account.)
Raw variability is different from explainable variability, however, so unless you plan to guess the mean in every game and hope that your adversary does likewise, knowledge of this raw variability actually doesn't help much. In fact, even this dually naive strategy would only see you win about 50.8% of games.
Time then to fit a model or twelve. For this exercise I've split the games across the 2000-2011 seasons roughly into halves, using one half for fitting the models and the other half for testing them.
Fitting linear models firstly, we find that the mean absolute prediction error of the model fitting Home team scores is 19.77 points per game for the holdout sample and for the model fitting Away team scores is 20.11 points per game. That seems like a fairly convincing win to the Home team model, though it turns out that on a game-by-game basis the Home team model predictions are closer to the actual Home team score than the Away team model predictions are to the actual Away team score in only 50.4% of holdout contests. That's an edge, but you'd want to play a huge number of games before you'd be confident that it was an almost certainly profitable edge. (Even after 1,000 wagers you've only about a 59% probability of being in front with an edge this small.)
These models are, by the way:
- Predicted Home Team Score = 263.05 + 0.21434 * Home MARS Rating - 0.29459 * Home Price - 0.38257 * Away MARS Rating + 1.81324 * Away Price + 1.21864 * Interstate Status
- Predicted Away Team Score = 209.40 - 0.29038 * Home MARS Rating + 0.24404 * Home Price + 0.17276* Away MARS Rating - 1.22039 * Away Price - 4.36038 * Interstate Status
Both models explain only a surprisingly small proportion of the variability of their respective target variable. The model predicting the Home team score explains only 16% of the variability in Home team scores in the holdout games, while the model predicting the Away team score explains just 17% of the variability in Away team scores in those same holdout games. That means more than 80% of the variability in team score is down to factors other than pre-game form and game venue - things such as the prevailing weather conditions and other on-the-day influences.
What About Non-Linear Relationships
I wondered if a more savvy adversary, cunningly forced by me to predict the Away team score, might do a little better by adopting a non-linear modelling technique, so I also fitted random forests, (unbiased) conditional inference trees, gradient boosted regressions and support vector machines too. From experience I know that these modelling approaches, unlike linear models, sometimes benefit from having fewer explanatory variables rather than more, so I also fitted versions of each model type after dropping, each in a separate model:
- Both bookmaker prices
- The Home team bookmaker price only
- The Away team bookmaker price only
- Both teams' MARS Ratings
This approach also allows us to make an assessment of the relative importance of these dropped variables in that the greater the decline in model performance when a variable or variables are dropped, the more important they might be claimed to be in the full model.
The results of all this modelling are in the table below:
In summary, we can say that:
- With MAPE on the holdout as the measure of performance, no model outperforms the linear model in predicting Home team scores, and only the gradient boosted model using a laplace distribution and dropping the bookmaker's Home team price very narrowly outperforms the linear model in predicting Away team scores. Frankly, it's virtually a draw: 20.06 plays 20.11.
- In predicting Home team scores, dropping both bookmaker prices generally has the largest detrimental effect on the absolute prediction error of the resulting model in the holdout games. Dropping both MARS Ratings is, on average, next most detrimental. Forced to choose between dropping the Home team price or the Away team price, model performance is generally least affected by dropping the Home team price.
- In predicting Away team scores, dropping both bookmaker prices is most detrimental to model performance about as often as dropping both MARS Ratings. As well, on balance, dropping the Away team bookmaker price is only slightly less often to be preferred than dropping the Home team bookmaker price.
- Generally, being able to predict within about 20 points of the Home team or of the Away team final score is an aspirational modelling target.
(Largely because I'd already done most of the preparatory work I decided also to fit the models to the scores expressed as the Home team victory margin. As we've found before, predicting within about 30 points represents the apex of victory margin modelling.
In addition, it turns out that the linear model prevails here too and that, forced to surrender variables, giving up both prices generally hurts most, losing both MARS Ratings hurts next most, and foregoing the bookmaker's Home Team price hurts least.)
Faced with the wagering proposition presented at the start of this blog your best approach is to opt to predict the Home team's score and to use the linear model presented earlier - and then hope that your adversary doesn't have a ratings system that's superior to MARS.