About 18 months ago I investigated the statistical properties of home teams' and away teams' scoring behaviour over the period from the start of the 2006 season to the middle of the 2012 season taken as a whole. In that blog, using the VGAM package, I found that the Normal distribution provided a reasonable fit to the scores of Home teams and a much better fit to the scores of Away teams over that entire period.
I went on in that blog to estimate the correlation between Home team and Away team scores and to fit a bivariate Normal to the data.
In this blog I'll be revisiting some similar themes but expanding the time horizon for the analysis, reaching all the way back to the start of the competition in 1897 and, rather than considering this expanse of history in its entirety, I'll be modelling each season separately, a strategy that's virtually a necessity given the lack of homogeneity in scoring across that period, as we'll see.
Included in the analysis are the results of every game in every season, including the Finals. Home team status has been based on the AFL's designation for all home-and-away season games and based on my own methodology for all Finals, in which I attach Home team status to the team with the higher All-Time MARS Rating at the time.
Below are 2d-density plots of Home team (x-axis) and Away team (y-axis) scores for every season. In each plot, lighter blues represent higher result density - in other words, a larger number of games with scores in that vicinity.
(This image, and all others in this blog, can be clicked to access a full-screen version.)
From the plot you can see the gradual increase in Home team and in Away team scores over the early years of the competition, reflected in the gradual north-easterly movement of the light blue patch from one season to the next. You can also judge the level of correlation in the scores by estimating the slope and concentration of the cloud of data points, especially the light blue portion.
The scores of Home and Away teams are highly correlated (ie close to 1 in absolute value) when the slope - positive or negative - is greatest and the cloud of points is relatively compressed. If you look at the panels for the last three seasons you'll see good examples of high correlation between Home team and Away team scores. Here, numerically, the correlations are all about -0.4. By comparison, 1987 is a good example of a season where the correlation is near zero (it was -0.04) - it's more blob than submarine.
Finally, the sign of the slope matches the sign of the correlation. So, for example, in the last three seasons where the slope has been negative, the corresponding correlations have also been negative. This means that, in those seasons, games in which the Home team has scored more than an average Home team in that season, the Away team has tended to score less than an average Away team in that season.
Positive slopes and hence positive correlations have been rare, with only 20 sightings recorded in VFL/AFL history, and none in the past two decades. The most recent season in which Home team and Away team scores were positively correlated, meaning that relatively high-scoring games for the Home team were also relatively high-scoring games for the Away team, was 1989.
The propagate R package was released in late 2013 and provides the ability to fit a variety of statistical distributions to univariate data. It ranks the fit of each distribution on a given data set on the basis of the AIC of the fit and returns, for each distribution, estimates of all parameter values. The documentation for this package cautions that "a decent number of observations should be at hand in order to obtain a realistic estimate of the proper distribution", which might not be the case for some of the early seasons where whole seasons comprised as few as 62 games. Still, any excuse to learn a new R package ...
The following table summarises the performance of all 20 distribution types available in the fitDistr function, each distribution being fitted in turn to the data for each season.
Considering the entire history of VFL/AFL and looking firstly at modelling Home team scores we find that the Curvilinear Trapezoidal distribution (see GUM 2008) is the best-fitting distribution most often. It assumes that position in 21% of the seasons. Further, it finishes amongst the best three distributions in 37% of seasons, and in the top five in 56% of seasons.
The Curvilinear Trapezoidial distribution performs about equally as well when fitted to Away team scores, finishing in the top five in about one-half of all seasons.
The Gamma, Normal, Trapezoidal and Laplace distributions each also finish amongst the three best-fitting model for Home team and for Away team scores in at least 25% of seasons. Other distributions of note are the Log Normal, Logistic and Gumbel distributions, none of which claim gold very often for both Home and for Away teams but all of which finish in the top five for both team types about 40% of the time or more.
Focussing our attention on just the last 50 years of history (during which no season has had fewer than 112 games so we might feel more confident about the appropriateness of using the fitDistr function), the performances of three distributions stand out: Gamma, Curvilinear Trapezoidal and Normal. Any of the three seem to be reasonable choices for modelling the Home team or the Away team scores as univariate random variables in any season.
One other thing that's interesting to note about the results for the 1964 to 2013 period is that we see, as we did in the blog I linked to at the start of this post, that the Normal distribution tends to fit Away team scores better than it fits Home team scores.
We've shown that the marginal distributions of Home team and Away team scores can be best approximated by a Gamma, Curvilinear Trapezoidal or Normal distribution. Home team and Away team scores are not independent however, as we saw in the charts in the first section so, if we want to jointly model Home and Away team scores for a game, we need to consider bivariate distributions.
Of the three distribution types just listed, only the Normal can be readily fitted in R as a bivariate using existing packages. The VGAM R package includes a version of a bivariate Gamma known as the McKay distribution, but it imposes conditions on the underlying variates that are not met by our Home and Away scoring data, and I can find no other package that supports the fitting of a bivariate Gamma. The Curvilinear Trapezoidal distribution seems to be a relatively new creation, so it's not surprising that no bivariate version of it can be fitted in R using an existing package either.
So, proceeding as before by treating each season on its own, we use the mlest function from the mvnmle package to fit 117 bivariate Normal distributions, the parameters of which are summarised in the following chart.
The points in each panel reflect the actual parameter values for the particular metric in a given season while the line is a loess fit to highlight the underlying local trend.
In the first panel we can see a general rising trend in Home team scores up until about 1940 after which scores fell for about a decade, stayed flat for a decade more, before rising to a peak of about 120 points per game in the early 1980s and then sliding back to the current figure of just under 100 points per game.
The next panel records the variability of Home team scores about the season average and shows much higher levels of season-to-season variability than do average scores. The standard deviation in Home team scores was lowest around the 1920s when it was near 2-and-a-half goals, and highest in the early to mid 1980s when it reached levels of about 6 goals.
Next is the panel for Away team score averages, which shows a very similar pattern to that of Home teams in terms of rises and falls (the correlation is +0.99), but generally tracks at a level about 8-10 points lower.
The standard deviation of Away team scores, shown in the fourth panel, also has a similar pattern to that for Home team scores (the correlation is +0.77). On average, however, the standard deviation of Away scores is about 2 points less than that of Home scores, though the difference is only about two-thirds of that if we consider just the last 50 seasons.
Finally, the fifth panel records the correlation between Home and Away scores and shows highly negative correlations in the early seasons gradually becoming less negative until about 1920 then generally tracking within a narrow band between -0.25 and 0 for about 60 years with an occasional excursion outside that range, then a general decline right up to and including the most-recent seasons, leaving the correlation for each of the last three years around the lowest point since the competition began.
Exactly what might be driving these more highly negative correlations in team scores is an interesting topic to ponder - thoughts and suggestions are welcomed.