The themes in this blog have been bouncing around in my thoughts - in virtual and in unpublished blog form - for quite a while now. My formal qualifications are as an Econometrician but many of the models that I find myself using in MoS come from the more recent (though still surprisingly old) Machine Learning (ML) discipline, which I'd characterise as being more concerned with the predictive ability of a model than with its theoretical pedigree. (Breiman wrote a wonderful piece on this topic, entitled Statistical Modelling: The Two Cultures, back in 2005.)
What's troubled me, I've begun to realise, about my adoption of ML techniques is their apparent arbitrariness. Just because, say, a random forest (RF) performs better than a Naive Bayes (NB) model in fitting a holdout sample for some AFL metric or outcome, who's to say that the RF technique will continue to outperform the NB approach in the future? It not as if we can speak of some variable as being "generated by a random forest like process", or identify situations in which an RF model can, a priori, br deemed preferable to an NB model.
Note that I'm not here concerned about the stationarity of the underlying process driving the AFL outcome - every model throws its hands up when the goalposts move - but instead with the relative efficacy of modelling approaches on different samples drawn from approximately the same distribution. Can we, at least empirically, ever legitimately claim that classifier A is quantifiably better at modelling some phenomenon than classifier B?
It took the recent publication of this piece by Delgado et al to crystallise my thoughts on all this and to devise a way of exploring the issue in an AFL context.
THE DATA AND METHODOLOGY
Very much in the spirit of the Delgado paper I'm going to use a variety of different classifiers, each drawing on the same set of predictive variables, and each attempting to model the same outcome: the handicap-adjusted winner of an AFL contest.
I've been slightly less ambitious than the Delgado paper, choosing to employ only 45 classifiers, but those I've selected do cover the majority of types covered in that paper. The classifiers I'm using are from the Discriminant Analysis, Bayesian, Neural Networks, Support Vector Machines, Decision Trees, Rule-Based Methods, Boosting, Bagging, Random Forests, Nearest Neighbour, Partial Least Squares, and General Linear Models families. All I'm missing are representatives from the Stacking, Multivariate Adaptive Regression Splines, and Other Ensemble tribes.
The data I'll be using is from the period 2006 to 2014, and the method I'll be adopting for each classifier is as follows:
- Select the data for a single entire season.
- For those classifiers with tuning parameters, find optimal values of them by performing 5 repeats of 4-fold cross-validation using the caret package.
- Assess the performance of the classifier, using the optimal parameter values, when applied to the following season. In other words, if the classifier was tuned on the data from season 2006, measure its performance for season 2007.
The regressors available to all the classifiers are:
- Implicit Home Team Probability (here using the Overround Equalising approach)
- Home and Away teams' bookmaker prices
- Home Team bookmaker handicap
- Home and Away teams' MARS Ratings
- Home and Away teams' Venue Experience
- Home team's Interstate Status
- Home and Away teams' recent form, as measured by the change in their MARS Ratings over the two most-recent games from the same season. For the first two games of each season these variables are set to zero.
The main issue I'm seeking to explore is whether any algorithm emerges as consistently better than any other. With the technique and timeframe I've used I obtain 8 measures of each algorithm's performance on which to base such an assessment - not a large sample, but definitively larger than 1.
Secondarily, the methodology also allows me to explore the effects, if any, of tuning the classifiers using different performance metrics and of measuring their performances using different metrics.
In interpreting the results that follow it's important to recognise the inherent difficulty in predicting the winners of line-betting contests, which is what we're asking the various classifiers to do. Were it possible to consistently and significantly predict more than 50% of line-betting winners (or to produce probabilities with associated Brier Scores under 0.25 or Log Probability Scores - as I'll define them later - over 1) then bookmakers everywhere would be justifiably nervous. We probably can't so they're probably not.
With that in mind, it's not entirely surprising that the best-performed classifiers only barely achieve better-than-chance results, no matter what we do.
The first set of charts that appear below summarise the results we obtain when we tune the classifiers using the AUC metric.
Two C5.0-based classifiers finish equal first on the Accuracy metric, averaging just over 53% across the 8 seasons for which predictions were produced. Twenty nine of the classifiers average better-than-chance, while the worst classifier, treebag, records a significantly worse- than-chance 47.5% performance.
The whiskers at the end of every bar reflect the standard deviation of the classifier's Accuracy scores across the 8 seasons and we can see that the treebag classifier has a relatively large standard error. Its best season was 2007 when it was right 55% of the time; in 2012 it was right only 40% of the time.
Because of its binary nature, Accuracy is not a metric I often use in assessing models. I've found it to be highly variable and poor at reliably discriminating amongst models and so have preferred to rely on probability scores such as the Brier Score or Log Probability Score.(LPS) in my assessments. Charts for these two metrics appear below.
Note that, for the purposes of this analysis I've defined the LPS as 2+log(P) where P is the probability attached to the final outcome and the log is measured in base 2. Adding 2 to the LPS makes the chance LPS score equal to +1 (ie the LPS a classifier would achieve by assigning a probability of 50% to the home team in every contest). The equivalent chance Brier Score is 0.25. Recall that higher LPS' and lower Brier Scores are preferred.
The ranking of the classifiers on these two metrics is very similar, excepting that the C5.0Rules algorithm does well under an LPS regime but not under a Brier Score regime. The blackboost, ctree, gbm, cforest algorithms, along with three SVM-based and four Partial Least Squares-based algorithms, all fill the top places on both measures. Only blackboost and ctree record better-than-chance values for both metrics. They also predict with significantly better than 50% Accuracy.
(Note that three classifiers - rpart, rpart2 and treebag - fail to register an LPS because they occasionally generate probabilities of exactly 0 or 1, which can lead to undefined LPS'.)
In a future blog I might look at bootstrapping the probability scores of some or all of the classifiers with a view to assessing the statistical significance of the differences in their performances, but for now I'll settle for a tentative conclusion that the 10 or so best classifiers under the Brier Score metric are, in some sense, generally superior to the 10 or so worst classifiers under that same metric.
Now in this first analysis I've tuned the classifiers on one metric (AUC) and then measured their performance on a holdout dataset and on different metrics (Accuracy, Brier Score and LPS). What if, instead, I tuned the classifiers using the Brier Score metric?
Perhaps unsurprisingly it doesn't make much difference when we assess the subsequent models using the Accuracy metric.
We find the same two C5.0-based classifiers at the head of the table, gbm and nnet doing a little better than previously, a few classifiers doing a little worse, but not a lot changing. In fact, 17 classifiers do worse in terms of Accuracy when tuned using a Brier Score instead of an AUC metric, 18 do better, and 10 stay the same.
Accuracy though, as I've already noted, is not my favourite metric, nor was it the metric on which we based our tuning in this round of analysis. That metric was instead a probability score - specifically, the Brier Score - so let's see how the models perform when we assess them using probability score metrics.
The blackboost and ctree classifiers continue to lead, and six of the 10 best classifiers under the AUC-based tuning regime remain in the top 10 under a BS-based tuning regime. The two neural network classifiers, nnet and anNNet, benefit greatly from the change in tuning regime, while the four Partial Least Squares classifiers suffer a little.
Overall though the change of tuning regime is a net positive for classifier Brier Scores, Twenty seven classifiers finish with a better BS, while just four finish with an inferior BS, and five classifiers now show better-than-chance Scores.
The story is similar for classifier Log Probability Scores, with 23 recording improvements, six showing declines, and five finishing with better-than-chance Scores.
Those five classifiers are the same in both cases and are:
We started with a goal of determining whether or not it's fair to claim that some classifiers are, simply, better than others, at least in terms of the limited scope to which I've applied them here (viz the prediction of AFL line-betting results). I think we can claim that this appears to be the case, given that particular classifiers emerged as better-than-chance performers across 8 seasons and over 1,600 games even when we:
- Changed the basis on which we tuned them
- Changed the metric on which we assessed them
Specifically, it seems fair to declare blackboost, ctree, cforest and gbm as being amongst the best, general classifiers, and the nnet classifier as capable of joining this group provided that an appropriate tuning metric is employed.
Classifiers based on Support Vector Machines and Partial Least Squares methods also showed some potential in the analyses performed here, though none achieved better-than-chance probability scores under either tuning regime. These classifiers, I'd suggest, require further work to fully assess their efficacy.
One other phenomenon that the analyses revealed - reinforced really - was the importance of choosing appropriately aligned tuning and performance metrics. Two of the four best classifiers, cforest and gbm, only achieved better-than-chance Brier and Log Probability Scores when we switched from tuning them on AUC to tuning them on BS. On such apparently small decisions are profits won and lost.
I'll finish with this sobering abstract from David J Hand's 2006 paper :
"A great many tools have been developed for supervised classification,
ranging from early methods such as linear discriminant analysis
through to modern developments such as neural networks and support
vector machines. A large number of comparative studies have been
conducted in attempts to establish the relative superiority of these
methods. This paper argues that these comparisons often fail to take
into account important aspects of real problems, so that the apparent
superiority of more sophisticated methods may be something of an illusion.
In particular, simple methods typically yield performance almost
as good as more sophisticated methods, to the extent that the difference
in performance may be swamped by other sources of uncertainty that
generally are not considered in the classical supervised classification
... which reminds me of this:
“So we beat on, boats against the current, borne back ceaselessly into the past.”