Bookies, I think MAFL has comprehensively shown, know a lot about football, but just how much more do they know than what you or I might glean from a careful review of each team's recent results and some other fairly basic knowledge about the venues at which games are played?
One of the main ways that a bookmaker's knowledge is distilled is in the head-to-head prices that he or she offers, and it's this expression of their nous that I'll be analysing in the first part of this blog.
Let start by giving the bookies far better than an even break - let's give them the quintessential strawman to defeat. If bookies know something - anything - about football, then we should be able to predict the outcome of a given football game on the basis of their head-to-head prices better than we could predict it in the absence of any detailed knowledge of the game.
To investigate the truth or otherwise of the previous sentence I need to define 'outcome', which for this blog I'll define as margin of victory, and 'better', which I'll define in a minute.
So, let's consider the prediction of the victory margin. Absent any knowledge of football bar the fact that it's played between two teams, the most naive prediction of the result of any contest would be that it finished in a draw - that is, with a victory margin of zero.
How well would this naive approach perform? We could measure this using the same metrics that I've been using for measuring the performance of the MAFL Margin Tipsters, that is, the Mean Absolute Prediction Error and the Median Absolute Prediction Error.
Measured over the home-and-away seasons commencing Round 13 of 1999 (the reason for this apparently odd starting point will become evident later) and ending with Round 22 of 2009, this naive approach to victory margin prediction would yield a Mean APE of 34.6 points per game and a Median APE of 30 points. Our strawman has now been constructed - now to see if the bookies can set it alight.
To see if they can we need firstly to convert the bookie's starting prices into predictions of victory margins. (We could, of course, use the line markets to do this but I'm going to ignore this for now, not least because I only have line market data since 2006.) To make this conversion I've been using a fantastic statistical tool called Eureqa, which has allowed me to consider millions of possible model formulations to identify the handful that perform well.
For the purposes of modelling I've summarised the bookie's head-to-head prices as probabilities. So, if team A is at price $A and team B is at price $B then Team A's probability of victory is B / (A+B) and Team B's is A/(A+B).
Using these bookie probabilities as input the model that Eureqa found for me, which I'll cal the Bookie Model, was the following :
Predicted Victory Margin = 20.7981*ln(Prob) + 56.8537*Prob^2
where 'Prob' is the team's probability.
So, for example, a team with an estimated 60% probability of victory (ie one priced at around $1.55 to its opponent's $2.35) would be predicted to win by 20.7981*ln(0.6) + 56.8537*0.6^2, which is 9.8 points.
(As a quick confirmation of this model's validity I note that teams priced at $1.55 have usually given starts of 10.5 points on line betting.)
This model produces a Mean APE of 30.00 points per game over the 1999-2009 timeframe described above and a Median APE of 25.23. That's about a 13% lower Mean APE than we got with the naive approach and a 16% lower Median APE. Bookies 1 - Strawman 0.
Another way we might gauge the performance of this model is to calculate the proportion of the variability in victory margins that can be explained using it (ie the R-squared of its predictions). It's about 25%, which is okay.
Okay, so we know that bookies know more than a naive forecaster does. Time to create a more statistically adept competitor.
What if we created a model using only the following data:
- MAFL MARS Ratings (which are really just a numerical summary of each team's previous performances, weighted by the quality of the opposition it has faced)
- The results of each team's 12 most recent games in the home-and-away season (stretching back into the previous home-and-away season if necessary)
- Information about whether a team is at home or not, and about whether a team or its opponent is playing interstate or not
None of these measures incorporate any of the 'additional' pieces of knowledge that are available to the bookmakers in the days leading up to the game, such as player ins and outs, team dramas, team form and all the other stuff that fills up so much of the sports pages in Victoria during footy season.
Using Eureqa again and the data for the home-and-away games from Round 13 of 1999 (which I've chosen as my starting point because I'm using each team's most recent 12 results as I noted above) to the end of 2009. I came up with the following model:
Predicted Victory Margin = 0.091119*Ave_Res_Last_2 + 672.347*Own_MARS - 672.483*Opp_MARS + 13.9791*Interstate_Clash
Ave_Res_Last_2 is the aggregated and averaged result of the last 2 games (so, for example, if the team had won their previous game by 9 points and lost the game prior to that game by 15 points, then their Ave_Res_Last_2 would be (9-15)/2 = -6/2 = -3.)
Own_MARS is the team's own MARS Rating prior to the game, divided by 1000
Opp_MARS is the team's opponent's MARS Rating prior to the game, also divided by 1000
Interstate_Clash = +1 if the team is playing in its home state and its opponent is playing interstate (eg Sydney at home to Melbourne), -1 if the reverse is true (eg Sydney playing at the MCG), and 0 if neither or both teams are playing interstate (eg Hawthorn v Carlton at Aurora).
This model, which I'll call the Smart Model, has a Mean APE of 29.71, about 1% lower than the model based just on bookie probabilities, a Median APE of 24.99, also about 1% lower, and has an R-squared of 26.4%, some 1.4% points better than that for the bookie's model.
Now, to be fair, the data I have for bookie probabilities for seasons 1999 to 2005 is not data I collected myself, so I can't vouch for its quality, but given the data I have it does appear that, at least in terms of predicting victory margins, we can do at least as well as the bookies using nothing more than previous game results and knowledge about the game's venue.
I've spoken previously about the fear modellers have that they've produced a model that has overfitted the available data - that is, which has been constructed in such a way that is has become skilled in explaining the data that was used in its creation, to the detriment of its ability to predict results for games it hasn't seen. Think "HELP Model" at this point.
In creating the models used in this blog I've taken two precautions to guard against overfitting. One precaution is built into the Eureqa tool, which allows you to split the data that you use to construct models into that which will be used to build a candidate model and that which will be used to evaluate it and eliminate it from consideration if it performs badly on the unseen data.
The other precaution is that I completely excluded data from the 2010 season from what I loaded into Eureqa, so this data was neither used for model construction nor for its internal validation by Eureqa. The idea is that if the performance of the Smart Model relative to the Bookie Model is similar for this 2010 data then we can feel more confident that the strong performance we witnessed by the Smart Model on data from the 1999-2009 timeframe is not merely a result of overfitting.
For the 2010 data up to and including Round 12, the comparison is as follows:
- Mean APE: Bookie Model 28.63; Smart Model 29.11 (about 1.7% worse)
- Median APE: Bookie Model 25.46; Smart Model 23.64 (about 7.1% better)
- R-squared: Bookie Model 33.9%; Smart Model 32.0% (about 1.9% points worse)
That's not a slam-dunk for the Smart Model's superiority over the Bookie Model, but it certainly doesn't suggest any chronic overfitting either.
Conclusion: In terms of predicting victory margins there's not a lot more information, if any, in a bookmaker's prices than that which resides in the knowledge of recent game results and knowledge about the game venue. That means we have hope ...
More in future blogs.