Performance-Testing the In-Running Model Against 2017 to 2019 Data

In the previous blog, we created a quantile regression model that allowed us to estimate, in-running, a home team’s victory probability, and to create in-running confidence intervals for the home team’s final margin.

We evaluated that model based on a variety of performance metrics calculated using a 50% holdout sample from the original data set, which included games spanning the 2008 to 2016 period.

But nothing really measures a model’s performance better than a completely fresh data set from a non-overlapping time period, and in this blog we’ll be running the same metrics, but for games spanning the 2017 to 2019 period (up to and including the first week of the 2019 Finals). That’s 616 games entirely unseen by the model.

(The fact that you’re reading this blog probably gives you a clue that the story mostly ends well, but I have to admit that testing one of my models on truly fresh data - especially a client who’s already paid for the version you’re testing - is one of the most exhilarating and terrifying aspects of this predictive modelling thing that I do for a living. You really can’t hide it when a model fails to generalise post-sample.)


Firstly, let’s look at the calibration of home team probability forecasts at different points in the game.

On the left we have the latest results, based on 2017 to 2019 data, and on the right the results from the previous blog, which were based on the holdout data from 2008 to 2016.

The results are (mercifully) comparable. In fact, you’d probably rather have those results on the new data set than those on the right for a game fraction of 95%. And, you’d rather have neither result for a game fraction of 99%, but that might well be hard to fix without additional data - or at all - as we discussed in the previous blog.

Next, we’ll review the profile of mean log probability score across all game fractions, which is not so much a performance metric as an interesting model performance characteristic.

The profile is strikingly similar, though a lot smoother on the new data, and spans a slightly larger range of mean scores.

When we look at the confidence interval coverage of the final margin forecasts, we again find that the model fits the new data about as well as the old, and we still have the issues with coverage on this new data set that we detected in the previous blog, which kick in from about the midpoint of the final term onwards.

There are also some small, but persistent, deviations from the ideal in these new results that might be worth further investigation - for example, the 90th and 75th percentile estimates seem to be a little high for most of the game, as do the 50th percentile estimates - but nothing large enough to be a huge concern.

Lastly, let’s review and compare the profile of the mean absolute error of the median quantile model.

In general terms, margin prediction over the past few years - and this year in particular - has been a little easier, and we see that reflected in this new chart. Our start of game, quarter-time, half-time mean, and three-quarter time mean absolute errors are all down by about a point per game.


So, the in-running model seems to have performed about as well on the 2017 to 2019 data as it did on the holdout data from 2008 to 2016.

That should give us some comfort in using the model for future games.

The fact that the coverage of the confidence intervals is too broad in some sections of the game - that, for example, an 80% confidence formed using the 10th and 90th percentile forecasts sometimes covers as much as 85 or 90% of the true distribution - raises the possibility that the different rates of scoring in recent seasons might have reduced the range of final margins that are possible given a current margin at some defined point in the game. One way of remedying that might be to include terms in the model that adjust according to the rate of scoring (or rate of converting scoring shots) in the current game.

More to come.