Clustering is a statistical technique used to determine natural groupings in a set of data, to find points that are more similar to one another than they are to other points. To use it you need to select, to begin with, a metric by which to quantify the "distance" between any two points in your data set and you need also to select the elements of your data that you'll feed in to this metric.
In 2009 I used clustering to define different types of Grand Finals, and in 2010 I did the same thing for Home-and-Away games. In both cases I used team leads at the end of each quarter as the sole input, which means that games can only be clustered after they're completed rather than before. This renders these clustering typologies more coronial than predictive.
For today's blog I'll be creating a game clustering that uses as input only the information that we might reasonably know pre-game - for example, the pre-game team MARS Ratings, Bookmaker prices (or some metric derived from them), and information about the game venue.
I'll once again be using R's pam algorithm - which has proved as trusty as any I've used - coupled with the daisy package to calculate distances via the "manhattan" metric on standardised data with the following inputs:
- Home team implicit victory probability using the Log Probability Score Optimising (LPSO) approach
- Home team and Away team MARS Ratings
- MARS Ratio (the ratio of the Home team MARS Rating to the Away team MARS Rating)
- The Interstate Status of the game
So, one game will be considered as similar to another to the extent that the Home team had a similar pre-game probability, that the Home and Away teams had similar MARS Ratings and a similar ratio of those Ratings, and based on whether or not the Interstate Status of the game was the same.
The clustering process produced a 5-cluster solution for the 1,144 games across seasons 2007 to 2012 with the profile below. (The dimensions in blue are those that were used to create the clusters, while those in black were used only to profile them. Click on the image for a larger version.)
The naming of the clusters - always a fraught step - is somewhat subjective though in this case draws strength from the distinctiveness of many clusters on a few key dimensions. For example, four of the clusters are either all-Interstate or all non-Interstate affairs, and Home team favouritism (or not) strongly defines four of the five clusters too.
To be specific:
- The first cluster, which I've named "Non-Interstate, Home Team Underdogs", contains no Interstate games (where the Home team is playing in its home state and its opponents are not), and only 11% of games have a Home team favourite. The victory margin of these games appears to be the most difficult to predict of all the clusters, with the Mean APE of games in this cluster (using the LPS Optimised Margin Predictor) 31.5 points per game.
- The second cluster, which I've named "Interstate, Home Team Heavy Favourite", contains only Home team Interstate favourites. The Mean APE for these games is the second-highest of all the clusters at 29.9 points per game, suggesting that the victory margin of these games is also relatively hard to predict. Home teams in these games have, historically, been compelled to give too little start by the TAB Bookmaker, as a consequence of which they've won 55.8% of all line contests.
- The third cluster, which I've named "Interstate, Home Team Narrow Favourite", contains only Interstate games. In about two-thirds of them the Home team is also the favourite, though the average estimated Home team victory probability is only 56%, and the Home team carries a higher MARS Rating only 25% of the time. Margins for these games are the most-easily predicted, with the Mean APE lowest of all the clusters at just 27.0 points per game.
- The fourth cluster, which I've named "Mostly Interstate, Away Team Heavy Favourite", contains three-quarters Interstate games and only games where the Away team is the favourite. The victory margins for these games are relatively easily predicted, with the Mean APE just 28.4 points per game, but this apparent predictability has not helped the TAB Bookmaker to set effective handicaps, as the Away team has failed to cover the spread 59.1% of the time. These games therefore represent the plumpest line-betting opportunity but, unfortunately, represent just 14% of the sample.
- The fifth and final cluster, which I've named "Non-Interstate, Home Team Favourite", contains no Interstate games and games where the Home team is the favourite 99% of the time. Home teams in these contest also do well on line betting, winning some 55.6% of the time.
It's worth pointing out that the differences we see in the line-betting performance of the Home team across the clusters were not an inevitable consequence of the clustering process since the Home team's line-betting performance was not used as a direct input to the clustering algorithm. (Actually it would be interesting to go back and include it to see what it changed.) Because of this we can feel a little more confident that the differences we see here might persist into the future.
Each of the five clusters is similar in size, with the smallest including 14% of games and the largest 25%.
The six seasons included in the analysis have quite distinct cluster profiles:
The most-recent season, 2012, is notable for the over-representation of games classified as "Mostly Interstate, Away Team Heavy Favourite", and under-representation of games classified as "Interstate, Home Team Narrow Favourite" and "Non-Interstate, Home Team Favourite".
Looking at the cluster profile based on a simple grouping of rounds also shows significant differences.
Games classified as "Non-Interstate, Home Team Favourites" predominate in the Finals, while those classified as "Mostly Interstate, Away Team Heavy Favourites" skew towards the latter portions of seasons generally. Further, games from the "Interstate, Home Team Narrow Favourite" cluster crop up more often in the first half of the season and in the Finals, while games classed as "Interstate, Home Team Heavy Favourite" are slightly more prevalent in the second half of the home-and-away season only, and games classified "Non-Interstate, Home Team Underdogs" are far more likely to be witnessed during about the first three-quarters of the home-and-away season.
PREDICTING CLUSTER MEMBERSHIP
So far all we have is an interesting basis on which to classify games historically. To make the clustering framework useful as a means of classifying prospective games we need to create rules that allow us to assign a game to a cluster on the basis of the relevant input data for that game. For this purpose I availed myself of the services of the PART function in the RWeka package, which allowed me to create just 18 rules that described the 5 clusters with over 99% accuracy.
These rules must be interpreted sequentially - so, for example, rule 3 can only be applied once the games meeting the criteria in rules 1 and 2 have been excluded.
In these rules:
- Interstate_Status is defined as the usual +1/0/-1 variable, where +1 is applied to a game where the Home team is playing in its home State and the Away team is not, 0 is applied where neither or both the Home and Away teams are playing in their home State, and -1 is applied where the Away team is playing in its home State and the Home team is not
- MARS_Ratio is the Home team MARS Rating divided by the Away team MARS Rating
- LPSO_Prob_Home is the victory probability of the Home team estimated using the Log Probability Score Optimised Implicit Probability Predictor for the Home team (which is 1/Home team head-to-head price - 1.0281%)
- Own_MARS_Rating is the Home team's MARS Rating
- Opponent_MARS_Rating is the Away team's MARS Rating
- 1 represents the first cluster in the earlier table (ie "Non-Interstate, Home Team Underdogs"), and 2 through 5 represent the remaining clusters, in order, from left to right
So, for example, the first rule says that cluster 5 ("Non-Interstate, Home Team Favourite") should be predicted when Interstate Status is 0 or -1, and the MARS Rating of the Home team is 1.005589 or more times the MARS Rating of the Away team, and the LPSO estimated Home team victory probability is 47.7524% or greater. This rule applies to 224 games from the sample and correctly classifies all 224 of them.
The 18 rules shown here mis-classify only 5 of the 1,144 sample games.
I'll finish by pointing out that, even though the Home team line betting performances are starkly different from 50% for three of the five clusters, wagering exclusively at $1.90 on games from any of these clusters would not have been profitable in every season from 2007 to 2012.
That said, a profit could have been made wagering on the Home team in line markets in five of the six seasons - including 2012 - for games from the second and the fourth clusters.