In the previous blog I used a clustering algorithm - Partitioning Around Medoids (PAM) as it happens - to group games that were similar in terms of pre-game TAB Bookmaker odds, the teams' MARS Ratings, and whether or not the game was an Interstate clash. There it turned out that, even though I'd clustered using only pre-game data, the resulting clusters were highly differentiated with respect to the line betting success rates of the Home teams in each cluster.
I commented in that post that:
It's worth pointing out that the differences we see in the line-betting performance of the Home team across the clusters were not an inevitable consequence of the clustering process since the Home team's line-betting performance was not used as a direct input to the clustering algorithm. (Actually it would be interesting to go back and include it to see what it changed.) Because of this we can feel a little more confident that the differences we see here might persist into the future.
That bracketed thought stuck with me for a couple of days, which was just long enough for me to, firstly, decide that I had to know what would change if I redid the clustering and, secondly, realise that it might make for a more interesting blog if I treated the whole process as an exercise in building a predictive model that could actually be used as a Fund algorithm if it looked promising. (I should note that, for this to be a sensible course of action, you need to convince yourself that the TAB Bookmaker might be systematically biased in his handicapping abilities and that those biases are related to the Home/Away status of the teams, whether or not the clash is an Interstate clash, and/or to the MARS Ratings of the teams, as these are the only variables, other than the line betting outcome for the Home team, that are part of the clustering process.)
What I won't be doing in this blog is re-explaining in any detail the process of clustering the games or of building the rules that can be used to predict cluster membership. For that information, I'd refer you to the previous blog. What I will do here though is explain the model-building and model-testing processes step-by-step.
STEP 1: CREATE THE CLUSTERS ON A HOLDOUT SAMPLE
If you want to build a model to use in the wild and you want to stop yourself from believing you've created something with preternatural predictive powers, only to find that these powers dissipate in the presence of an unseen future, you need to reduce the likelihood that your model overfits the data. One of the best ways that I know of to do this is by creating a holdout sample that you assiduously withhold from the model-building and testing process until you hand-on-heart vow to make no more changes to the predictive model you've loving crafted.
The moment you fall into the trap of testing your model on the holdout sample and then going back and retweaking your predictive model is the moment that your holdout sample can only rightly be thought of as part of the training sample and therefore useless as a means of testing your model's true predictive ability.
For the purposes of creating the clusters for this blog I randomly selected 500 games to be used as the training set and then locked away the remaining 644 games from the 2007 to 2012 seasons to act as the holdout sample.
Clustering those 500 games using the same clustering variables as in the previous blog but now also including the line betting outcome for the Home team as an input variable, PAM suggested that an 8-cluster solution looked about right. Here's what it looked like (as usual, the image below can be clicked on to view a larger version):
(For now, ignore the bottom-most rows where the row labels are red - I'll come back to those.)
The main thing to notice in this table for now is that the clustering has done an outstanding job of building groups of games with homogenous or near-homogenous line betting outcomes for the Home team. Five of the segments are completely pure (ie contain only Home teams line betting winners or Home team line betting losers), another is virtually so, and the remaining two clusters include games where Home team were triumphant on line betting at least three-quarters of the time. They're also quite well-differentiated on the other variables used for clustering - which we're implicitly assuming are the correlates of the TAB Bookmaker's bias as I noted parenthetically earlier in this post.
Were we to recreate this line betting predictive performance in real life well, frankly, we'd be rich.
To see how likely that is we need to apply this clustering framework to the holdout sample, and to do that we need to generate a ruleset that we can apply to game data to allocate each game to a cluster.
STEP 2: CREATE A CLUSTER-PREDICTING RULESET
Any number of ruleset-producing algorithms might be used for this step, but I chose a fairly simple tree-based one called PART, which is available in R's RWeka package. Experience has taught me that, left untuned and allowed to grow trees to their fullest extent, PART will often overfit as it moulds and shapes itself to every nuance of the sample dataset - which turn out to be nuances only of the sample dataset. A simple remedy, I've found, is to use the tuning parameter that controls the minimum size of a node in the tree that PART constructs - in simple terms, to control the minimum number of games to which any identified rule pertains.
(Note that I didn't impose this constraint on the rules I generated in the previous blog because my goal there was to describe rather than to predict.)
The strategy of restricting the minimum node size produces fewer rules that tend to generalise better but not fit the sample data as well as an unconstrained PART tree would. The trick is to choose the minimum node size so as not to compromise the fit too dramatically but, simultaneously, not allow too many rules to be created. For the current problem I set the minimum node size to 25, which resulted in just 8 rules that, combined, assign games to their correct cluster about two-thirds of the time:
The input variables I provided to PART were the same variables that I used for the clustering, with the target variable the cluster number to which each game had been assigned by PAM.
In the above, Rule 1 for example suggests that we should allocate a game to cluster 1 if it is an Interstate clash (ie Interstate is greater than 0, which means it must be +1, denoting an Interstate clash) where the Home team's MARS Rating is more than 1.63% higher than the Away team's. That rule applies to 74 games and is wrong on only 10 occasions.
STEP 3: APPLY THE RULES TO THE TRAINING SAMPLE
Next we need to determine what our wagering strategy is going to be for games in each cluster. For that we need firstly to determine how well these rules work on our training sample of 500 games. This information is provided in the first two rows with red row labels in the first table from this blog.
This shows, for example, that the Home team won on line betting in 66.2% of the games that the ruleset assigned to cluster 1. So, when we apply the rules to the holdout sample we'll be selecting a Home team win on line betting whenever the game in question is assigned to cluster 1.
We can do the same thing for the remaining 7 clusters, noting that our decision will be "no bet" if the Home team winning rate for a particular rule is below 52.6% and above 47.4% because such a rate would not be sufficient to cover the overround if we're wagering at $1.90.
We wind up determining to wager on the Home team whenever the ruleset assigns a game to any of clusters 1, 6, 7 or 8; to wager on the Away team whenever the ruleset assigns a game to clusters 2 or 4; and to refrain from wagering whenever the ruleset assigns a game to cluster 3. Note that we need not worry about what to do if the ruleset assigns a game to cluster 5 because our ruleset never does this.
STEP 4: APPLY THE RULES TO THE HOLDOUT SAMPLE
Only now do we allow ourselves to expose our final model to the holdout sample. The results of this reality-test appears as the remaining rows in the earlier table.
What we find is that the ruleset performs at better-than-chance levels for games that it assigns to clusters 1, 4, 6 and 8, but worse-than-chance for games that it assigns to clusters 2 or 7. Overall, its success rate is 52.7%, which is just barely enough for it to be profitable in the real world wagering at $1.90.
Your instinct might be to tweak the model for use in the real world to wager only when the ruleset assigns a game to clusters 1, 4, 6 or 8 - but if you follow that course of action you've just turned the holdout sample into a training sample and run the risk that you've opened the door to overfitting just as it was slinking down the driveway looking for a less savvy modeller.
We're left then with a model that looks promising if not outstanding. We'd probably want to make dummy wagers with it for a season or two before we did anything rash.
INVOKING THE HINDSIGHT-BIAS
In hindsight we might have had our suspicions about games assigned to cluster 2 had we looked closely at the "confusion matrix" for the ruleset on the training sample. This matrix is a cross-tabulation of each game's actual cluster against the ruleset's predicted cluster for that game. For the current ruleset it looked like this:
Notice that the accuracy of the ruleset for those games it assigned to cluster 2 is just 56%, which is the second-lowest accuracy rate for any of the assigned clusters, better only than the accuracy rate for games assigned to cluster 3, which is a rule that we're ignoring for wagering purposes anyway.
This same strategy, however, would not have warned us about the perils of wagering on games that the ruleset assigns to cluster 7. If anything, we'd have been encouraged by the ruleset's 80% accuracy in such games. We'd also have been unnecessarily nervous about games assigned to clusters 4 and 6.