On Choosing Strong Classifiers for Predicting Line Betting Results

The themes in this blog have been bouncing around in my thoughts - in virtual and in unpublished blog form - for quite a while now. My formal qualifications are as an Econometrician but many of the models that I find myself using in MoS come from the more recent (though still surprisingly old) Machine Learning (ML) discipline, which I'd characterise as being more concerned with the predictive ability of a model than with its theoretical pedigree. (Breiman wrote a wonderful piece on this topic, entitled Statistical Modelling: The Two Cultures, back in 2005.)

What's troubled me, I've begun to realise, about my adoption of ML techniques is their apparent arbitrariness. Just because, say, a random forest (RF) performs better than a Naive Bayes (NB) model in fitting a holdout sample for some AFL metric or outcome, who's to say that the RF technique will continue to outperform the NB approach in the future? It not as if we can speak of some variable as being "generated by a random forest like process", or identify situations in which an RF model can, a priori, br deemed preferable to an NB model.

Note that I'm not here concerned about the stationarity of the underlying process driving the AFL outcome - every model throws its hands up when the goalposts move - but instead with the relative efficacy of modelling approaches on different samples drawn from approximately the same distribution. Can we, at least empirically, ever legitimately claim that classifier A is quantifiably better at modelling some phenomenon than classifier B?

It took the recent publication of this piece by Delgado et al to crystallise my thoughts on all this and to devise a way of exploring the issue in an AFL context.

THE DATA AND METHODOLOGY

Very much in the spirit of the Delgado paper I'm going to use a variety of different classifiers, each drawing on the same set of predictive variables, and each attempting to model the same outcome: the handicap-adjusted winner of an AFL contest.

I've been slightly less ambitious than the Delgado paper, choosing to employ only 45 classifiers, but those I've selected do cover the majority of types covered in that paper. The classifiers I'm using are from the Discriminant Analysis, Bayesian, Neural Networks, Support Vector Machines, Decision Trees, Rule-Based Methods, Boosting, Bagging, Random Forests, Nearest Neighbour, Partial Least Squares, and General Linear Models families. All I'm missing are representatives from the Stacking, Multivariate Adaptive Regression Splines, and Other Ensemble tribes.

The data I'll be using is from the period 2006 to 2014, and the method I'll be adopting for each classifier is as follows:

  1. Select the data for a single entire season.
  2. For those classifiers with tuning parameters, find optimal values of them by performing 5 repeats of 4-fold cross-validation using the caret package.
  3. Assess the performance of the classifier, using the optimal parameter values, when applied to the following season. In other words, if the classifier was tuned on the data from season 2006, measure its performance for season 2007.

In step 2 we'll optimise for two different metrics, Area Under the Curve and Brier Score, and in step 3 we'll measure performance in terms of Accuracy, Brier Score and Log Probability Score.

The regressors available to all the classifiers are:

  • Implicit Home Team Probability (here using the Overround Equalising approach)
  • Home and Away teams' bookmaker prices
  • Home Team bookmaker handicap
  • Home and Away teams' MARS Ratings
  • Home and Away teams' Venue Experience
  • Home team's Interstate Status
  • Home and Away teams' recent form, as measured by the change in their MARS Ratings over the two most-recent games from the same season. For the first two games of each season these variables are set to zero.

The main issue I'm seeking to explore is whether any algorithm emerges as consistently better than any other. With the technique and timeframe I've used I obtain 8 measures of each algorithm's performance on which to base such an assessment - not a large sample, but definitively larger than 1.

Secondarily, the methodology also allows me to explore the effects, if any, of tuning the classifiers using different performance metrics and of measuring their performances using different metrics.

THE RESULTS

In interpreting the results that follow it's important to recognise the inherent difficulty in predicting the winners of line-betting contests, which is what we're asking the various classifiers to do. Were it possible to consistently and significantly predict more than 50% of line-betting winners (or to produce probabilities with associated Brier Scores under 0.25 or Log Probability Scores - as I'll define them later - over 1) then bookmakers everywhere would be justifiably nervous. We probably can't so they're probably not.

With that in mind, it's not entirely surprising that the best-performed classifiers only barely achieve better-than-chance results, no matter what we do.

The first set of charts that appear below summarise the results we obtain when we tune the classifiers using the AUC metric.

Two C5.0-based classifiers finish equal first on the Accuracy metric, averaging just over 53% across the 8 seasons for which predictions were produced. Twenty nine of the classifiers average better-than-chance, while the worst classifier, treebag, records a significantly worse- than-chance 47.5% performance. 

The whiskers at the end of every bar reflect the standard deviation of the classifier's Accuracy scores across the 8 seasons and we can see that the treebag classifier has a relatively large standard error. Its best season was 2007 when it was right 55% of the time; in 2012 it was right only 40% of the time.

Because of its binary nature, Accuracy is not a metric I often use in assessing models. I've found it to be highly variable and poor at reliably discriminating amongst models and so have preferred to rely on probability scores such as the Brier Score or Log Probability Score.(LPS) in my assessments. Charts for these two metrics appear below.

Note that, for the purposes of this analysis I've defined the LPS as 2+log(P) where P is the probability attached to the final outcome and the log is measured in base 2. Adding 2 to the LPS makes the chance LPS score equal to +1 (ie the LPS a classifier would achieve by assigning a probability of 50% to the home team in every contest). The equivalent chance Brier Score is 0.25. Recall that higher LPS' and lower Brier Scores are preferred.

The ranking of the classifiers on these two metrics is very similar, excepting that the C5.0Rules algorithm does well under an LPS regime but not under a Brier Score regime. The blackboost, ctree, gbm, cforest algorithms, along with three SVM-based and four Partial Least Squares-based algorithms, all fill the top places on both measures. Only blackboost and ctree record better-than-chance values for both metrics. They also predict with significantly better than 50% Accuracy.

(Note that three classifiers - rpart, rpart2 and treebag - fail to register an LPS because they occasionally generate probabilities of exactly 0 or 1, which can lead to undefined LPS'.)

In a future blog I might look at bootstrapping the probability scores of some or all of the classifiers with a view to assessing the statistical significance of the differences in their performances, but for now I'll settle for a tentative conclusion that the 10 or so best classifiers under the Brier Score metric are, in some sense, generally superior to the 10 or so worst classifiers under that same metric.

Now in this first analysis I've tuned the classifiers on one metric (AUC) and then measured their performance on a holdout dataset and on different metrics (Accuracy, Brier Score and LPS). What if, instead, I tuned the classifiers using the Brier Score metric?

Perhaps unsurprisingly it doesn't make much difference when we assess the subsequent models using the Accuracy metric. 

We find the same two C5.0-based classifiers at the head of the table, gbm and nnet doing a little better than previously, a few classifiers doing a little worse, but not a lot changing. In fact, 17 classifiers do worse in terms of Accuracy when tuned using a Brier Score instead of an AUC metric, 18 do better, and 10 stay the same.

Accuracy though, as I've already noted, is not my favourite metric, nor was it the metric on which we based our tuning in this round of analysis. That metric was instead a probability score - specifically, the Brier Score - so let's see how the models perform when we assess them using probability score metrics.

The blackboost and ctree classifiers continue to lead, and six of the 10 best classifiers under the AUC-based tuning regime remain in the top 10 under a BS-based tuning regime. The two neural network classifiers, nnet and anNNet, benefit greatly from the change in tuning regime, while the four Partial Least Squares classifiers suffer a little.

Overall though the change of tuning regime is a net positive for classifier Brier Scores, Twenty seven classifiers finish with a better BS, while just four finish with an inferior BS, and five classifiers now show better-than-chance Scores. 

The story is similar for classifier Log Probability Scores, with 23 recording improvements, six showing declines, and five finishing with better-than-chance Scores.

Those five classifiers are the same in both cases and are:

  • blackboost
  • ctree
  • nnet
  • cforest
  • gbm

CONCLUSION

We started with a goal of determining whether or not it's fair to claim that some classifiers are, simply, better than others, at least in terms of the limited scope to which I've applied them here (viz the prediction of AFL line-betting results). I think we can claim that this appears to be the case, given that particular classifiers emerged as better-than-chance performers across 8 seasons and over 1,600 games even when we:

  1. Changed the basis on which we tuned them
  2. Changed the metric on which we assessed them

Specifically, it seems fair to declare blackboost, ctree, cforest and gbm as being amongst the best, general classifiers, and the nnet classifier as capable of joining this group provided that an appropriate tuning metric is employed.

Classifiers based on Support Vector Machines and Partial Least Squares methods also showed some potential in the analyses performed here, though none achieved better-than-chance probability scores under either tuning regime. These classifiers, I'd suggest, require further work to fully assess their efficacy.

One other phenomenon that the analyses revealed - reinforced really - was the importance of choosing appropriately aligned tuning and performance metrics. Two of the four best classifiers, cforest and gbm, only achieved better-than-chance Brier and Log Probability Scores when we switched from tuning them on AUC to tuning them on BS. On such apparently small decisions are profits won and lost.

I'll finish with this sobering abstract from David J Hand's 2006 paper :

"A great many tools have been developed for supervised classification,
ranging from early methods such as linear discriminant analysis
through to modern developments such as neural networks and support
vector machines. A large number of comparative studies have been
conducted in attempts to establish the relative superiority of these
methods. This paper argues that these comparisons often fail to take
into account important aspects of real problems, so that the apparent
superiority of more sophisticated methods may be something of an illusion.
In particular, simple methods typically yield performance almost
as good as more sophisticated methods, to the extent that the difference
in performance may be swamped by other sources of uncertainty that
generally are not considered in the classical supervised classification
paradigm."

... which reminds me of this:

“So we beat on, boats against the current, borne back ceaselessly into the past.”

How Often Does The Best Team Win The Flag?

Finals series are a significant part of Australian sporting life. No professional team sport I know determines its ultimate victor - as does, say the English Premier League - on the basis of a first-past-the-post system. There's no doubt that a series of Finals adds interest, excitement and theatre (and revenue) to a season, but, in the case of VFL/AFL at least, how often does it result in the best team being awarded the Flag?

Read More

VFL/AFL Home-and-Away Season Team Analysis

This year, Sydney collected its 8th Minor Premiership (including its record when playing as South Melbourne) drawing it level with Richmond in 7th place on the all-time list. That list is headed by Collingwood, whose 19 Minor Premierships have come from from the 118 seasons, one season more than Sydney/South Melbourne and 11 more than Richmond.  

Read More

The 2014 Grand Final : When the Coin Flipped

The Sydney Swans were deserved pre-game favourites on Saturday according to most pundits (but not all - congratulations to Robert and Craig for tipping the winners). At some point during the course of their record-breaking loss that favouritism was handed to the Hawks. In this blog we'll investigate when.

Read More

Grand Final History 1898-2013 : Winning Team Scoring Patterns

Only three teams in VFL/AFL history have trailed by more than three goals at Quarter Time in the Grand Final and gone on to win. The most recent was Sydney in 2012 who trailled the Hawks by 19 at the first break before rallying in the second term to kick 6.0 to 0.1, eventually going on to win by 10 points, and before that Essendon who in 1984 trailed the Hawks by 21 points at Quarter Time - and still trailed them by 23 points at Three Quarter Time - before recording a 24 point victory on the strength of a 9.6 to 2.1 points avalanche in the final term.

Read More

Scoring Catenation: An Alternative Measure of Momentum

Almost two years ago, in a post-GF funk, I recall painstakingly cutting-and-pasting the scoring progression from the afltables site for 100 randomly-selected games from 2012. I used that data to search for evidence of in-game momentum, there characterising it as the tendency for a team that's just scored to be the team that's more likely to score next.

Read More

Scoring Shot Conversion Rates: How Predictable Are They?

In my earlier posts on statistically modelling team scoring (see here and here) I treated Scoring Shot conversion as a phenomenon best represented by the Beta Binomial distribution and proceeded to empirically estimate the parameters for two such distributions, one to model the Home team conversion process and the other to model Away team conversion. The realised conversion rates for the Home team and for the Away team in any particular game were assumed to be random, independent draws from these two fixed distributions.

Read More

Leading and Winning in AFL

One of the bets that's offered by TAB Sportsbet is on which of the teams will be the first to score 25 points. After analysing scoring event data for the period 2008 to 2014 provided by Paul from afltables.com I was surprised to discover that the first team to score 25 points goes on to win the game over 70% of the time.

Read More

When Do AFL Teams Score?

Soccer goals, analysis suggests, are scored at different rates throughout the course of matches as teams tire and as, sometimes, one team is forced to press for a goal or chooses to concentrate on defending. Armed with the data provided by Paul from afltables.com, which includes every scoring and end-of-quarter event from every game played between the start of season 2008 and the end of the home-and-away season of 2014, we can investigate whether or not the same is true of AFL scoring.

Read More

Scoring In Bursts: Evidence For In-Game Momentum?

The notion of momentum gets flung about in sports commentary as if it's some fundamental force, like gravity, that apparently acts at both long and short distances. Teams have - or don't have - momentum for periods as short as a few minutes, for perhaps half a quarter, going into the half-time break, entering the Finals, and sometimes even as they enter a new season, though I think when we start talking about momentum at the macro scale we wander perilously close to confusing it with another fundamental sporting force: form. It's a topic I've addressed, in its various forms, numerous times on MoS.

Read More

Are Some Games Harder to Predict Than Others?

If you've ever had to enter tips for an office competition where the the sole objective was to predict the winner of each game, you'll intuitively recognise that the winners of some games are inherently harder to predict than others.

Read More

SuperMargin Implications? Yes, They Are Atrocious.

In a recent blog I developed an empirical model of AFL scoring in which I assumed that the Scoring Shots generated by Home and Away teams could be modelled by a bivariate Negative Binomial and that the conversion of these shots into Goals could be modelled by Beta Binomials.

Read More