On Friday night, while watching the progress of the Saints v Freo game knowing that Investors had a SuperMargin wager on the Saints to win by 20-29 points, I was wondering how to react to the changes in the scoreline as the game progressed. Should I want the Saints to lead early? By a little? By a lot? By about 5 points at Quarter Time and 10 points at Half Time?
So I set myself the statistical challenge to construct models that could answer the following question: what is the probability that the final score will finish in SuperMargin bucket X given that the home team margin at quarter time, half time, or three-quarter time is Y?
There are a number of possible modelling approaches to follow here, and I decided to adopt two of them:
- a proportional odds logistic regression (using the polr function in the MASS package for R)
- a multinomial logit (using the vglm function in the VGAM package in R)
I fitted three models of each type, one using the home team margin at quarter time (and the square of that margin), another using the half-time home team margin (and its square), and a third using the three-quarter time margin (and its square). In all cases the target variable was the SuperMargin bucket of the final result.
In choosing a time period for the data with which to fit these models, I had to trade off the desire to use the games from more recent seasons, which are more likely to exhibit margin dynamics similar to today, with the need to ensure that I had a reasonable sample of games with outcomes for each of the 23 possible buckets. The most problematic buckets are those representing the most extreme outcomes.
Eventually, weighing these two competing desires, I decided to use all games from the start of the 1980 season to the end of the 2011 season. That gave me over 5,000 games in total and at least 30 games in every bucket, as you can see from the table at left.
It's interesting to note that the most common bucket for game outcomes over this period has been a Home win by 10-19 points. This result represents almost 9% of all game outcomes.
Next most common is a Home team win by 1-9 points, which has occurred only 10 times less often across the entire 5,000+ game sequence.
Thereafter come the results of Home team by 20-29 and by 30-39 points, and only then do we come to the most common result involving an Away team victory: Away team wins by 10-19 points, which comprise a little over 7% of all outcomes.
Away team wins by 1-9 points come next, occurring only 4 times fewer than wins by 10-19 points.
Another feature of this table is that huge Home team wins are far more prevalent than huge Away team wins, with Home team victories by 100 points or more occurring more than twice as often as Away team victories by the same margin.
To the statistical models then.
Proportional Odds Logistic Regression
In this formulation we assume that the final bucket of the result can be derived from the Home team margin and its square at a chosen point in the game, a set of fixed coefficients to be estimated, and an (unknown) error term. (See this wikipedia entry for details.)
Using firstly the Home team margin at quarter time, the fitted model yields the following (please click on the image for a larger version):
This table provides the probability that the final result is in bucket X given a particular Home team margin at quarter time. The probabilities sum to 1 across each row.
So, for example, if the Home team leads by 10 points at quarter time, the most likely final buckets are the Home team by 10-19, the Home team by 20-29 and the Home team by 30-39 point buckets, each of which have an about 11% probability.
One of the features of this table to note is how spread out the probability density is across each row. In many cases the highest probability is only about 12 or 13% and there are 5 or more buckets with probabilities near or above 10%. What that tells us is that the quarter time margin is a relatively weak indicator of the final result.
Next the table based on the Home team margin at half time:
We see a little more peaking in the probability density now, with the most likely bucket in each column generally attracting 15% or more of the total probability. This means that, even if you've secured only the rock-bottom $7 price for a bucket with the TAB, if the game score is in the right bucket you now have a wager with a positive expectation.
Finally, the table for the model based on the Home team margin at three-quarter time:
Now, the buckets with the highest probabilities generally hold 20% or more of the total density. If you've a wager priced at $7 you can even afford to be out by a bucket either way at this point of the game and still know that you're holding a bet with a positive expectation.
This model also suggests that a wager on the draw has a positive expectation at three-quarter time provided that the Home team margin is between about -10 and +5.
Multinomial Logistic Regression
The proportional odds formulation imposes some constraints on the fitted model in that only a single set of coefficients are estimated - 24 in total, 22 threshold coefficients defining the cutoffs between the 23 buckets, and 1 each for the Home team margin and its square. In doing this the formulation recognises the natural ordering of the buckets in that, for example, the threshold defining the break between an Away team win by 20-29 points and an Away team win by 10-19 points is forced to be numerically less than the threshold defining the break between an Away team win by 10-19 points and an Away team win by 1-9 points.
In contrast, the multinomial formulation assumes no such ordering of the buckets, treating each bucket as an entirely separate entity (for details, see this wikipedia entry).
This means that the fitted results for the two formulations can be, and are, different. Here, for example, is the table for the model based on the Home team margin at the end of the first quarter:
As you can see, the differences are not vast, however, particularly for margins in the -30 to +30 range.
Here too, as for the model built using the proportional odds formulation, we find that the probability density is smeared generously across each row, meaning that most SuperMargin wagers will not have a positive expectation at this point in the game.
Using the Home team margin at half time, we obtain this:
Again, the results are quite similar to those from the proportional odds formulation, though the peak probabilities in each row tend to be 1 or 2% points lower here, implying that fewer wagers priced at $7 will be in the money at half time if we prefer this formulation to the earlier one.
Finally, the results for the multinomial formulation using the Home team margin at three-quarter time:
Once again there's broad agreement between the results for this formulation and those based on the proportional odds formulation, and once again also we find generally smaller probability peaks with the multinomial formulation.
(So, if your SuperMargin wager is going well, you'd rather the probabilities from the proportional odds formulation; otherwise you'd prefer those from the multinomial formulation.)
These tables can be used to create some rough rules of thumb for SuperMargin wagering:
- If, at quarter time, the Home team margin is within about a bucket of the bucket on which you've wagered (ignoring the draw), you've about an 8-12% chance of landing your wager
- If the same is true at half time you've about a 10-15% chance
- If, glory of glory, the same is true at three-quarter time, you've about a 15-20% chance (though sometimes a little less if the margin is at the extremes of the adjoining bucket)
Note that, in building these models I've not used bookmaker prices. This would be an interesting addition and is something I might do in the future, but it would significantly restrict the time period over which I could fit models given that I only have reliable bookmaker prices from 2006 onwards.