Ensemble Encore

The idea of ensemble learning and prediction intrigues me, which, I suppose, is why I've written about it so often here on MoS, for example here in introducing the Really Simple Margin Predictors, here in a more theoretical context, and, much earlier, here about creating an ensemble from different Head-to-Head predictors. The basic concept, which is that a combination of forecasters can outperform any single one of them, seems plausible yet remarkable. By taking nothing more than what we already have - a set of forecasts - we're somehow able to conjure empirical evidence for the cliche that "none of us is better than all of us" (at least some of the time). 

My interest in the topic is broadly shared, with Surowecki's book, The Wisdom of Crowds, popularising the ensemble concept for a non-technical audience (although there have been those who disagree with the book's main premise), and those of a more technical bent in the machine learning community actively researching and discussing ensemble prediction and predictors.

To be clear though, wisdom does not inevitably emerge when opinions are combined. If I asked a dozen pre-schoolers and a PhD astro-physicist to provide an estimate of the age of the universe, it's unlikely that the 13 of them would do better than the doctor alone. Deciding what characteristics of an ensemble make it likely to be more accurate than its constituent parts and how to optimally and transparently combine those underlying forecasts, is one area of particularly fervent activity amongst practitioners and researchers. 

So, when I read about a new R package, caretEnsemble, which allows and greatly simplifies the task of building ensembles where the base learners are any of the available modelling techniques supported by the caret package such as random forests, support vector machines and ordinary least squares regression, you'll understand why I was keen to try it out.


Before we build an ensemble or 20, we need first to select a forecasting challenge for it to face. Many MoS posts have been written about predicting game margins, during the course of authoring which I've winnowed a set of potential regressors and developed an understanding of what average and exceptional performance looks like. Predicting game margins then, seems like a suitable challenge for the ensemble approach, and is the one I've adopted for today's blog. (I plan to use caretEnsemble for other modelling tasks and write about the experience in future posts.)

I used data from all games played during seasons 2006 to 2014 inclusive and defined as my target variable the game margin from the home team's point of view - that is, the home team's score less the away team's. The regressors used were:

  • Bookmaker Implied Home team victory probability (using the Overround-Equalising approach)
  • The Bookmaker's Home team and Away team head-to-head prices
  • The handicap imposed by the Bookmaker on the home team in the line market
  • The two teams' MARS Ratings
  • The two teams' Venue Experience
  • The Interstate Status of the game
  • The two teams' recent form, measured by the change in their MARS Ratings across the two most-recent games

(For details about these regressors, please refer to the MoS Primer.)

For each modelling run the data was split into a (different) training sample of 1,000 games, leaving 743 games as a holdout for assessing performance. All variables were scaled and centred as a pre-processing step.


In the first round of ensemble creation I selected 16 base learners for inclusion, these being chosen, very broadly, based on the results of this beauty parade of modelling techniques that I conducted in 2013, in which 52 different algorithms were pitted head-to-head. I deliberately chose exceptional and poor performers from that analysis on the assumption that the resulting ensembles might benefit from dissenting opinions and on the basis that an analyst chosen at random might be assumed to be just as likely to choose any of the available model types. (To use the language of machine learning, I'm including weak learners and strong learners.)

The 16 base learner algorithms were:

  • Ordinary Least Squares (lm)
  • Random Forest (randomForest)
  • Conditional Inference Forest (cforest)
  • Neural Net (nnet)
  • Blackboost (blackboost)
  • Stochastic Gradient Boosting Model (gbm)
  • Weighted K-Nearest Neighbours (kknn)
  • Multivariate Adaptive Regression Splines (earth)
  • Bagged Multivariate Adaptive Regression Splines (bagearth)
  • Partial Least Squares (pls)
  • Sparse Partial Least Squares (spls)
  • Projection Pursuit Regression (ppr)
  • Elastic Net (enet)
  • Ridge Regression (ridge)
  • Generalised Additive Model with Splines (gamspline)
  • Support Vector Machine with Linear Kernel (svmLinear)

For each ensemble-building run, the 16 base learner algorithms were tuned on the 1,000 game training data across a "sensible" grid for each tuning parameter using 5 repeats of 5-fold cross-validation with RMSE as the performance metric.

Once the base learners have been tuned for a given run, the caretEnsemble package offers two methods for constructing ensembles:

  • Forming a linear combination of some or all base learners via a greedy forward-selection approach.
  • Stacking their modelled outputs and using them as inputs into a further modelling round. The form of this latter model can be any of the modelling types supported by caret. For example, you might choose to combine the base learner outputs using the random forest algorithm with the regressors the base learner predictions and the target variable the actual game margins.

I created ensembles using both methods, the first of which I'll call Greedy Ensembles and the second Stacked Ensembles. For the Stacked Ensembles I chose to keep things simple, creating only a linear model of the base learner outputs (ie employing lm) and using cross-validation and RMSE to optimise the weights. Though, as we'll see, these Stacked Ensembles perform relatively poorly, some short, additional analyses that I performed suggested that the choice of a linear model for this purpose isn't the constraining factor.

Speaking of results, here firstly in a summary of the performance of the 16 base learners and the two ensemble models across 20 repetitions of the entire process, shown here in the form of rankings. The ensemble formed using greedy forward-selection (ENS_greedy) was ranked about 4th, on average, across the 20 replicates, using RMSE on the testing data as the performance metric. It finished above the 17 other algorithms three times, finished in the top 3 algorithms nine times, and finished in the top 10 algorithms in every replicate.

The Stacked Ensemble formed by applying OLS to the base learner outputs (ENS_linear) fared much less well, never coming in as best algorithm and only once finishing in the top 3. On 14 occasions it finished outside the top 10.

Partial Least Squares was the standout performer across the 20 replicates, finishing first six times, in the top 3 on 13 occasions, and never missing the top 10. Its average ranking was 2.8, comfortably the lowest of all the algorithms. Its near-cousin, Sparse Partial Least Squares was next best with an average ranking of 4th, albeit with considerably more variability in its ranking across replicates. The performance of Elastic Net was slightly worse than that of Sparse Partial Least Squares, though it was the only other algorithm to finish with an average ranking better than 5th.

Some algorithms, most notably blackboost, Weighted k-nearest neighbours, Random Forest, the Stochastic Gradient Boosted Model, and the Neural Net, performed consistently poorly on the RMSE metric in the holdout sample. 

Measuring performance instead by Mean Absolute Error produces a comparable ranking of the various algorithms, though gamspline and Projection Pursuit Regression do a little better, suggesting that they might be relatively more prone to the occasional large error, which worsens their RMSE more than their MAE. Partial Least Squares performs relatively even better than all other algorithms on this metric, its average ranking dropping to 2.3 partly on account of its 9 first-place finishes.

We can also assess the algorithms' performances by averaging actual average RMSEs and MAEs on the holdout samples across the 20 replicates, rather than averaging rankings.

On this view, based on RMSE, Elastic Net snatches 2nd from Sparse Partial Least Squares, and the Greedy Ensemble finishes 4th. Partial Least Squares remains ranked 1st, regardless of whether an RMSE or MAE metric is preferred, though the Greedy Ensemble squeezes into 2nd if we use the MAE metric.

What's also impressive about the performance of Partial Least Squares is the relatively small standard deviation of its RMSE and MAE results. In contrast, the relatively high level of variability in the RMSE and MAE results for the Greedy and Linear Ensembles is a surprising and less attractive feature of their performances. Ideally, we want algorithms that do well not just on average, but on every replicate.

If we look at the average composition of the Greedy Ensemble across the 20 replicates - that is, the average weighting that it applied to each base learner to come up with its opinions - we can see that it recognised the talent of Partial and Sparse Partial Least Squares (which, on some runs, came up with the same or very similar outputs) but allowed other, poorer-performed algorithms to have too much influence, and shunned some other of the better-performed algorithms.

On average, combined, Partial and Sparse Partial Least Squares accounted for a little over 80% of Greedy Ensembles, but Weighted K-nearest Neighbours, Conditional Inference Forest, blackboost, randomForest and Neural Net accounted for a collective 8%, which was far too generous a slice in retrospect. These algorithms, for whatever reason - perhaps overfitting on this particular dataset - proved unable to replicate their strong performances on training sets when applied to the matching test set. It would have been far better to allocate their collective 8% weighting to, say, OLS, which failed to appear in any of the Greedy Ensembles but turned in relative strong average holdout performances.The ridge algorithm was also under-weighted relative to its ability.

In summary then, and simplistically, you might conclude that the Greedy Ensemble performed only adequately, failing to outperform all of its constituent base learners, finishing clearly behind Partial Least Squares, and, at best, merely matching Sparse Partial Least Squares and Elastic Net.

But this way of thinking about the results assumes that you might somehow have known a priori that Partial Least Squares was the best and only choice. One advantage of the ensemble approach, however, is that you don't need such omniscience - you need only pick a variety of competent base learners and let the ensemble-building process construct a generally (and maybe always) good and sometimes superior predictor.

Viewed in this way you might instead summarise the results of Ensemble 16 as revealing that the Greedy Ensemble approach provides the best possible solution about 15% of the time and a near-best solution the rest of the time. That makes it seem a lot more attractive.


Still, I thought, perhaps we could do better. Maybe the Greedy Ensembles described above were hamstrung by the quality of the ingredients offered. What if I reduced the number of base learners to six, leaving out the very poorest performing algorithms and including four of the seven best performing algorithms?

For this second run of 20 replicates I did just that, including as base learners only:

  • Weighted K-Nearest Neighbours (kknn)
  • Multivariate Adaptive Regression Splines (earth)
  • Bagged Multivariate Adaptive Regression Splines (bagearth)
  • Sparse Partial Least Squares (spls)
  • Elastic Net (enet)
  • Generalised Additive Model with Splines (gamspline)

As we did earlier, let's firstly look at the performances based on model ranking.

On the RMSE metric, Sparse Partial Least Squares is best, but only narrowly better than the Greedy Ensemble, and the Linear Ensemble is now much better-performed. Elastic Net, though finishing 1st twice as often as the Greedy Ensemble, finishes in the top 3 only 11 times compared to the Greedy Ensemble's 18. So much more stable is the Greedy Ensemble's performance, in fact, that the standard deviation of its ranking across the 20 replicates is almost 30% less than that of Sparse Partial Least Squares.

Turning next to the MAE metric we find that the Greedy Ensemble is now best algorithm with an average ranking of 2.8 and a standard deviation of its rankings below that of all but the worst-performed algorithm in Weighted k-nearest Neighbours. The Greedy Ensemble earns its low average ranking not by being best algorithm more often (it's best only twice) but by being a good (ie top 3) algorithm often.

Looking instead at the average RMSEs and MAEs across the 20 replicates places the Greedy Ensemble marginally behind Elastic Net and Sparse Partial Least Squares, though the differences are exceedingly small. Practically speaking, the three methods are in a dead-heat for first. Elastic Net and Sparse Partial Least Squares do have in their favours, however, smaller standard deviations.

As (bad) luck would have it, the average holdout sample used for assessing Ensemble 6s was a little easier to predict than the average holdout sample used for assessing Ensemble 16s since all but one of the average RMSEs and MAEs for the models used in Ensemble 6s was smaller than the average RMSEs and MAEs for the same model types when used in Ensemble 16s. Elastic Net, for example, turned in an average RMSE of 36.48 points in the Ensemble 6 creation processes, but an average RMSE of 36.81 points in the Ensemble 16 creation process. On that basis, there's nothing useful to be said about the fact that the Ensemble 6 ENS_greedy and ENS_linear models had smaller average RMSEs and MAEs than the Ensemble 16 equivalents.

In percentage terms, however, the performance of ENS_greedy relative to the best base learner model is better for Ensemble 6 (100.07%) than for Ensemble 16 (100.20%), so in this sense Ensemble 6s can reasonably be said to be better than Ensemble 16s. Ensemble 6s also wind up being the best predictor, like Ensemble 16s, about 15% of the time.

The conclusion I'd draw from this analysis is that the Greedy Ensemble benefits from the exclusion of the poorest performing models. This, of course, presupposes that you know which base learners are more likely to perform poorly - and then we're back into the discussion of omniscience that we had earlier.


My summary of this entire exercise is that:

  • Ensemble methods appear to offer something worthwhile in the prediction of AFL game margins. They won't always deliver the best model but, given a broad enough set of competent base learners, they will reliably deliver a very good model.
  • The Greedy Ensemble approach is clearly superior to the Stacked Ensemble approach.
  • If you've any a priori knowledge of the performance of candidate base learners, ensure that you exclude those that you know are particularly poor.

There's a lot more testing and analysis I'd like to do with ensembles now that I've a feel for how the caretEnsemble package works. Over the course of the season I'll be doing this and writing more about their use and efficacy.