Clustering Men's AFL Games Based on the Margin Trajectory

It’s been a while since I’ve felt I’ve had much time to do more than post the outputs of any football analyses to Twitter, but sheltering-in-place and something of a lull in my consulting work has left me with a bit of time to redress that.

(Need some data analysed? I’m available. Rates negotiable.)

So, first up, I want to talk about a technique I’ve often thought about but never realised had been codified and - best of all - turned into an R package. That technique is time series clustering, and the R package that helps us do it is called TSrepr.

The TSrepr package allows you to perform dimension reduction on a set of time series a vast number of different ways, and produces a simplified representation of your original time series that is suitable for clustering.

In today’s blog we’re going to apply the technique to games from the men’s AFL competition.

DATA

For today’s blog we’ll use the score progression data for all home and away games played between 2010 and 2019, and we’ll consider the time series value at any point as the lead that the designated home team had at that point. We’ll create time series of equal length from that score progression data by “sampling” the margin for every game at 5% intervals, giving us 20 points per game.

METHODOLOGY

As foreshadowed, we’ll first use the repr_matrix() function from the TSrepr package. Specifically, we’ll call

repr_matrix(<time series dataframe>, func = repr_paa, args = list(q = 5, func = meanC), normalise = FALSE, func_norm = norm_z, windowing = FALSE, win_size = NULL)

This reduces our 1946 x 20 dataframe down to a 1946 x 4 matrix representation of it. Hard as seems to believe, it turns out that four values can describe the trajectory of a football game fairly well, as we’ll see.

We’ll pass that matrix representation through the pam() clustering algorithm from the cluster package in R, and calculate the Davies Bouldin metric for cluster solutions involving from 2 to 30 clusters. There’s nothing magical about that range, but in my experience there’s a practical upper limit on the number of clusters a useful solution can have, and I’ve somewhat arbitrarily set that limit at 30.

The code for this (where fd_modern_TS_Rep is the 1946 x 4 matrix) is

clusterings <- lapply(c(2:30), function(x) pam(fd_modern_TS_Rep, x))

DB_values <- sapply(seq_along(clusterings), function(x) intCriteria(fd_modern_TS_Rep, as.integer(clusterings[[x]]$clustering), c("Davies_Bouldin")))

ggplot(data.frame(Clusters = 2:30, DBindex = unlist(DB_values)), aes(Clusters, DBindex)) + geom_line(size = 1) + geom_point(size = 3) + theme_bw() + theme(axis.title.x = element_text(face="bold", size=20), axis.text.x = element_text(size=15, face = "bold"), axis.text.y = element_text(size=15, face = "bold"), axis.title.y = element_text(face="bold", size=20))

We get the following (fairly unattractive but highly functional) chart, which reveals an optimum when we have 27 clusters.

If we chart the time series for each of the 1,946 games based on the cluster to which that series belongs, we obtain the much more attractive and more interesting chart below. Note that the red line tracks the median margin for games within the cluster, and the blue line tracks the time series for the game that was closest to that median (in terms of average deviation from the median summed across the whole game).

We see that most cluster types contain games with remarkably similar home team margin trajectories. In Cluster 1, for example, the home team usually trails right from the start but mostly sees its deficit remain fairly constant throughout the second half. Contrast that with Cluster 3 where the home team’s deficit builds virtually linearly throughout the entire game. Cluster 27 has a similar overall shape, but the size of the eventual defeat tends to be somewhat larger.

Cluster 21 includes the largest number of games (121), but clusters 5 (105), 2 (104), 4 (102), and 15 (101) also contain more than 100 games. Only five clusters contain fewer than 50 games: Cluster 22 (15), Cluster 27 (36), Cluster 11 (38), Cluster 18 (45), and Cluster 14 (46). Every other cluster contains between 50 and 100 games.

THE ARCHETYPAL GAMES

In the previous chart, 27 games are identified as being the most archetypal, one for each cluster. The score progression for those games is shown in blue, and those games are:

The value shown in brackets is the final home team margin, and each game is linked to its matching AFLTables entry.

SEASON PROFILES

If you prefer to watch games where the result is in doubt for the longest, you’d probably opt for games with the smallest absolute average margin across the game. As the chart below reveals, you tend to find these in games from clusters 2, 4, 9, 10, 12, 13, 21 and 24.

In a typical year, somewhere between about 30% and 45% of home-and-away games will be from one of those clusters. In Season 2019, 39% of games came from those clusters, which followed on from a season-high 44% in 2018. The low of 31% came in 2012 and 2013.

The counts by cluster type for each season appear in the chart below.

TEAM PROFILE

To finish, let’s call the nine clusters we identified as having the lowest absolute mean margin as the “close” clusters, and see what proportion of each teams’ home and away games across the decade have come from these clusters.

For most teams, we see that between 30% and 40% of home and of away games have come from the close clusters. The notable exceptions are:

  • Brisbane Lions and Gold Coast, who’ve had relatively few games from the close clusters when playing away

  • Collingwood, Geelong, and Hawthorn, who’ve had a relatively large number of games from the close clusters when playing away

SUMMARY

In short, what we’ve found in the analysis for this blog is that there are 27 basic plots that describe all 1,946 home games from the seasons 2010 to 2019. Some games adhere closely to a particular plot, and some stray a little from their assigned plot, but there is nonetheless a broad score progression trajectory that defines them.

I think that’s just interesting in its own right, but would be keen to hear if you think it could form the basis of some further analysis.