Where Can I Find an Example of the Use of X?
As mentioned elsewhere, MoS exists partly as a vehicle through which I can learn about the real-world application of a range of statistical modelling and data analytic techniques. In the lists that follow I've attempted to catalogue them all but if you find one that I haven't please let me know and I'll add it.
(2016 Update: I've not amended or expanded the information that follows in quite a while. If you can't find what you're after, try searching using the options under the Search The Site section in the navigation bar on the right.)
Statistical Modelling / Machine Learning Techniques
The machine learning community distinguishes between mathematical algorithms for regression and those for classification, a dichotomy I'll also use here.
Regression algorithms allow us to map from a feature space - that is, a set of regressors - to a real (possibly bounded) value - a probability, a team score or a total score for example.
- Binary logits
Historically, I've used binary logits a lot on MoS. They're the go-to algorithm when the target variable is binomial so they often arise when I'm fitting a model to the game result, win or loss, from the point of view of the Home team or the point of view of the favourite. A binary logit was also used in this blog where the target variable was whether Kelly staking or Level staking was superior given certain characteristics of a notional bookmaker and a notional punter.
The technique made another appearance when first exploring the efficacy of Venue Experience in predicting game outcome, and also when modelling the Home team's result using in-running data and a model inspired by Brownian motion.
Doubtless there are a lot more uses that I've missed here.
- Conditional Inference Tree and Random Forests
Tree-based regression algorithms have also, historically, done a lot of the heavy lifting in MoS, though they do much of it behind the scenes (refer to the FAQ entry about the statistical modelling techniques used for prediction). Most commonly it's the variety known as the Conditional Inference Tree Forest that I'm using for the purposes of regression, but in blog posts it seems that I only ever write about them in the context of algorithmic competitions (such as this one, this one, this one and this one), where I'm testing a variety of algorithms on the same data.
I also use Random Forests, which were the forerunners of Conditional Inference Tree Forests, though I use them less and blog about them less still. The only references to random forests that I could find were this one in a blog about predicting the aggregate score of a game, and this one about whether it was easier to predict the Home or the Away team score.
Classification algorithms allow us to map from a feature space to a label - a type of grand final, a type of home-and-away season game for example.
- Partitioning Around Medoids
I've made surprisingly little use of classification techniques in MoS, on both occasions calling on the services of Partitioning Around Medoids (PAM), once to come up with a Grand Final typology, and once to come up with a typology for games from the home-and-away season.
Data Analysis Techniques
Most of the data analysis techniques I deploy in MoS are so routine they don't deserve chronicling here.
Measuring Non-Linear Relationships
- Maximal Information Criteria (MIC)
Using the MINER package in R and its implementation of the Maximal Information Coefficient (MIC) qualifies, I think, as an example of a less-mundane analysis. In this blog I used it to identify non-linear relationships between some of the variables used in MoS.
Visualising Relationships or Trends
- 3D Density and Contour Plots
Some data is best visualised in 3 dimensions, and in this blog I used the rgl package to plot the relationship between Home team and Away team scores as a 3D Density plot and as a contour plot.
- 2D Density Plots with Standard Errors
The R package ggplot2 allows the determined user to produce sophisticated and visually appealling charts. In this blog I used ggplot2 to plot a density function relating the outcome of various portions of a game from the Home team's viewpoint to the bookmaker's implicit probability for that same game.
- Multivariate Plots
ggplot was again deployed in this blog which looked at victory margins in Grand Finals and the home-and-away performances of each team in the 2010 season.
One of the terms used in the Grammar of Graphics is faceting, which relates to the act of creating the same graph on subsets of some data with those subsets defined by the faceting variable. I used faceting in this blog to investigate the changing nature of Home team probability from season to season, with season as the faceting variable.