Reoptimisation and the Fear of Overfitting : ChiPS 2016

Richard McElreath, in one of the lectures from his Statistical Rethinking course on YouTube aptly and amusingly notes that (and I'm paraphrasing) models are prone to get excited by exposure to data and one of our jobs as statistical modellers is to ensure that this excitability doesn't lead to problems such as overfitting. We want to model the underlying relationships in our data not the random noise that nature throws in to perplex Determinists.

With another full season's worth of data now available, we need to decide how excited we're going to allow the MoS models to become. In an earlier blog over on the Wagers and Tips section I've already discussed this topic and pointed out that, for the most part, I'd chosen not to re-estimate any of the algorithms where this was either possible or necessary. To that extent then, I've not merely dampened the models' excitability but prevented it.

The one exception to this no-tweaking rule was the ChiPS Rating System, which I there suggested would be allowed to update only its Home Ground Advantage parameters. On reflection, I've decided to be a little more permissive about ChiPS' excitability and allowed it to review all of its tuning parameters. The wisdom of this concession will, as ever, only be assessable in hindsight, but it seemed silly to re-optimise only a subset of the parameters when, in reality, the optimal values of all are interdependent.

At right is a table showing the new parameter values emerging from that re-optimisation and the equivalent parameter values from the version of ChiPS used in 2015. (The genesis of the ChiPS system and the very first round of optimisation is described in this post from 2014.) The new parameters were derived by minimising the Mean Absolute Error (MAE) they produced if they were used to update Team Ratings for every game across the period 2006 to 2015. The final, Divisor parameter was optimised separately to maximise the Log Probability Score (LPS) of the System's probability estimates given that all other parameters in the System were fixed at their optima as just described.

Optimising across 10 years of history rather than, as I did for the previous version of ChiPS, only five years, was a choice made to dampen the model's excitability in relation to the most-recent seasons' results, especially those of 2015. A rough analysis showed that team performances at home venues were quite different in 2015 compared to earlier seasons, and the effects of this variability is considerably reduced with the choice of a 10-year rather than a 5-year optimisation window. In the end then I'm hoping that the effects of including the 2006-2010 seasons will be moderating and beneficial rather than constraining and performance-impairing. 

The changes in parameters with the largest influence are those to the update multipliers, because these affect the size of Rating changes for every contest in the relevant portion of the season. What we can deduce from the new multipliers is that the results in the early rounds of 2016 will tend to result in larger changes to Team Ratings under the new parameter regime than they would have under the previous regime. That's no bad thing given the challenges that Essendon are expected to face this year - the sooner ChiPS is able to determine an accurate estimate of their underlying ability, the better.

Actually, making accurate assessments of the abilities of all teams based on their early-season results will be important for ChiPS this year because the three update multipliers for the other parts of the home-and-away season are smaller this year than they were last year - considerably so in the case of the multiplier for Rounds 12 to 17, which has fallen by over 15%. The results in Finals will, however, have a slightly larger potential impact on Team Ratings, but that will likely mean more for next season than for this one.

Looking next at what I've labelled the Home Team Net Venue Performance Adjustments but which might otherwise be thought of as Home Ground Advantage (HGA) estimates, we see a mix of small and large parameter changes. Only seven of the HGAs increased in the re-optimisation (including the HGA for the Other Team/Venue Combinations), while 20 of them fell. Ten of them are now negative - up from eight in the previous version - while 17 of them are positive. This is more, broad evidence, I'd suggest, of the decline in the advantages afforded by playing at home.

The most negative HGAs are Collingwood's 4-goal negative HGA at Docklands, Richmond's negative 20-point HGA at the MCG, and Sydney's 11-point negative HGA at Sydney Stadium. In the context of ChiPS, these values shouldn't be thought of solely as how much more poorly these teams perform in a gross sense at these venues, but instead how much more poorly they perform relative to expectations in a net sense, taking into account their opponents' performance. It might, for example, be the case that Richmond performs better than average at the MCG but that their opponents tend to perform even better still than their own average performances. Regardless of how we choose to interpret them, these large negative values proved impossible to reduce or eliminate in the re-optimisation process.

Amongst the remaining few parameters we see that the season-to-season Rating carryover, while still high, has reduced, that a margin cap is still not required (a cap of 190 points is no cap at all), that a team's excess form - which is the change in its Rating less the change in its opponent's Rating over the past two rounds - is marginally less important, and that the benefit of having your opponent fly interstate to play you has increased by about one-quarter of a goal. With three of the Finalists and five of the top 11 teams being non-Victorian in 2015, this latter increase was perhaps to be expected. Including the 2006 and 2007 seasons in the optimisation window would also, in all likelihood, have contributed to the increase too, as interstate teams did well in both these seasons.

Acknowledging the optimistic picture that reporting in-sample performance metrics inevitably paints, I've nonetheless calculated the season-by-season MAEs and LPSs and presented them in the table at left. If nothing else it makes me feel as though I've achieved something and the results are, as we would hope, very good, not least for 2015 where the MAE comes in at 28.5 points per game, which is more than 1 point per game better than ChiPS as it operated last season.

In the absence of a random holdout sample - which are challenging to construct in an inherently temporally correlated dataset - the potential for having overfit the data remains very much in play, of course.

Whilst the overall performance of the new ChiPS is impressive, its results on a team-by-team basis vary considerably. GWS has proven to be the team whose results have been most difficult to predict. When GWS has been playing at home, ChiPS has been in error about the final margin by, on average and in absolute terms, almost six goals. It's done just 1 point better when GWS has been playing away and has, therefore, across the entire period of GWS' history, be in error by 34.5 points per game. That's over 4 points per game worse than the team that has next-most troubled the new ChiPS: Hawthorn, where ChiPS' MAE is 30.3 points per game.

For no other team does the new ChiPS have an MAE above 30 points per game, while for four teams (the Eagles, Roos, Lions and Swans), the MAE is below 28 points per game. Sydney's home results have been especially predictable (26.5 MAE) as have West Coast's away results (26.8 MAE).

To finish, let's have a look at the Team Ratings that the 18 teams will take into Round 1 of season 2016. 

Ten teams will start with a Rating of over 1,000, but only two - last year's Grand Finalists - will start with a Rating of over 1,020. At the other end of the Rating ladder, three teams will face their Round 1 opponents with a Rating under 980, and three more, including Essendon, will start with a Rating under 990.

One way of quantifying the degree of competitiveness implied by the initial Team Ratings is to look at their spread around the mean Rating of 1,000. For 2016, the standard deviation of Team Ratings in Round 1 is 16.1 points per team, which is up a little on the 15.6 points per team figure for 2015, but down considerably on the figures for the three years prior, which were 18.7 points per team (2014), 19.2 points per team (2013), and 17.5 points per team (2012, including a generous 1,000 Rating for GWS in their inaugural season).

A more visual way of deriving a similar insight is to inspect the stripe chart of the Ratings for each of the past 11 seasons.

The similarity between 2015 and 2016 is immediately apparent, as is the relatively wider spread of Ratings for 2012, 2013 and 2014.

So, whilst 2015 and 2016 don't appear to have returned us to the levels of competitiveness that seemed in prospect for, say, the 2006, 2007 and 2010 seasons, we've at least left behind the spreads we saw in the 2012-2014 period. 

Let's hope that the new ChiPS has got at least that assessment correct.