Building and Performance-Testing an In-Running Model

I’ve created in-running models before, for the projected final total of a game in progress, as well as for the projected final margin and probability of victory.

For today’s blog I’m going to revisit that earlier model I built to project the final margin and estimate the home team’s probability in-running, with a view to being clearer about how the model was built, and how we can assess its efficacy.


I’m again going to use a quantile regression (fitted with the quantreg R package) as the underlying model, with the actual final margin from the home team’s perspective as the target variable. In a typical linear regression, we’re aiming to estimate the mean of some target variable as a linear function of some input variables but in a quantile regression we’re aiming to estimate a number of quantiles of the target variable as linear functions of some input variables.

So, for example, we might choose the median or 50th percentile as one of the quantiles to fit, and we might also choose the lower quartile or 25th percentile, and the upper quartile or 75th percentile. Or, we might, as I will, choose to fit all of the quantiles from the 1st percentile to the 99th percentile.

Fitting a number of quantiles allows us to create conditional confidence intervals for our target variable, rather than just estimating its conditional mean. We can therefore makes statements such as: “given the inputs, there’s a 50% chance that the target variable lies between A and B”, where A and B have been determined using, say, the fitted quantile regression models for the 25th and 75th percentiles.


Now, since we’re creating an in-running model, we’re going to need a data set that includes the score progression across a large sample of games. For this purpose we’ll use the data from AFLtables for the period 2008 to 2016 (which, in its raw form looks like this page for the Round 1 Geelong v Essendon game in 2010)

I’m hoping to scrape the data from 2017 to 2019 in the next few days, and will write a supplementary blog if I’m able to do so.

We could use the score progression data as it comes, but that would tend to overweight games with more scoring events and underweight those with fewer scoring events, since there’s one entry in the score progression data for every such event. I prefer, instead, to create 100 data points for each game, each recording the score at fixed 1% intervals of the game. So, for example, there will be one data point for the 25% mark of each game, which is the score at quarter-time.

We’ll supplement this dynamic score data with static data reflecting our pre-game opinions about the likely final margin, encapsulated both in the bookmaker’s pre-game handicap in the line betting market, and in our very own MoSHBODS pre-game expected margin. We’ll also use MoSHBODS’ views on the likely final home team and away team scores.

Before we do any transformations or “feature sculpting”, the raw data we’re working with for each game is then:

  • Current Score at 1% increments

  • Game Fraction (running from 1% to 99%)

  • Bookmaker’s Pre-Game Estimated Final Margin

  • MoSHBODS’ Pre-Game Estimated Final Margin

  • MoSHBODS’ Pre-Game Estimated Final Home Team Score

  • MoSHBODS’ Pre-Game Estimated Final Away Team Score


Each of the 99 quantile models is a linear regression, and we’ll be using one with six input terms, each a transformation of the raw data elements just listed.

Each quantile model looks like this:

The heart of the in-running model is the transforms we apply in building our terms for the regression, which need to reflect the passage of time, as recorded in the Game Fraction variable.

I don’t know of any objective, systematic way to come up with the “best” transforms, so I arrived at mine through a combination of common-sense, trial and error, and optimisation.

To be clear, what I did was to split the original data set, which included 178,600 data points (100 data points for 1,786 games) 50:50 into a training and a testing sample and then seek to maximise the log probability score of the fitted value averaged across the entire training sample.

The terms I ended up with are shown at right, the broad common-sense of which are that:

  • Terms that reflect the current state (1 and 2) become more important as the game progresses

  • Terms that reflect pre-game assessments (3 and 4) become less important as the game progresses

  • Terms that reflect the both the current state and pre-game assessments (5 and 6) are weighted more towards the current state and become more important, as the game progresses

Recall that the underlying model is just a series of linear regressions, each of the form of the first equation shown above.

In the table at right I’ve provided the fitted coefficients for 9 of the 99 quantile models.

Those in the column headed 50th are for the 50th percentile (ie the median), and provide the constants by which we multiply the relevant Terms to create an estimated median for the Final Margin based on the current state of the game.


It’s fairly straightforward, I think, to see how the model can be used to create in-running confidence intervals for the final margin.

For example, if we wanted a 90% confidence interval for the final margin at some point in the game we’d just use the fitted models for the 5th and 95th percentile, calculate their estimates and use the range they create as our 90% confidence interval.

But, we can also use the outputs from the 99 different fitted models to estimate the home team’s victory probability by looking at which quantile models provide estimated final margins closest to zero. If, say, the 5th percentile model gives an estimated final margin of -0.2 and the 6th gives an estimated final margin of +0.4, to a first-order approximation, the home team’s victory probability is about 95%.

We can get a little more precise in two ways:

  • Fitting more regression quantiles (say 999 rather than the 99 I have)

  • Interpolating between the outputs of the 99 percentile models we have

I’ve opted for the latter, lazier option (using the approx function from the stats package in R for interpolation) mainly because of the large overhead in fitting quantile regressions to a large number of quantiles.


Coming up with useful performance metrics for a model requires us to be precise about what we want that model to be good at.

One obvious requirement here is that our model’s home team probability estimates be well-calibrated - that is that home teams estimated to be an X% chance or winning at some point in the game actually do go on to win about X% of the time.

To measure this, we gather and bin all of the home team probability assessments for the same game fraction across all of the games in our test set, and then calculate the home team win rate for the games in each bin.

Again, being specific, we create a first bin that includes all games where the estimated home team probability was less than 10% at some designated point in the game and calculate how often the home team won games in that bin. We then do the same for games where the probability estimate was between 10% and 20%, 20% and 30%, and so on.

The chart below records the results of doing this for 9 selected game fractions (or points in the game) - after 5% of the game, 25% of the game (ie quarter time), and so on.

If our model is well-calibrated for a particular game fraction, the solid line will tend to track the dotted line, because the proportion of home teams winning will be roughly equal to the probability assigned to them.

We see that the calibration is exceptionally good until we have 5% or less of the game remaining (that equates to roughly the last 60 seconds of actual playing time).

It’s not entirely surprising that we might need to think about doing something different to model the dying stages of games as it’s clear that teams often play differently in this portion if they are, say, defending a small lead, or if the game is a blowout. For these games, a complete model would need to account for where the ball is on the field, which team has possession of it, and maybe even prevailing weather conditions and the teams’ positions on the competition ladder.

All things considered, to create a single model that’s well-calibrated from the first up to about the last minute, is a good result.

(Bear in mind, too, that there are likely to be relatively few games where the estimated probability is around 50% with just 1 minute to go, so the points for probability estimates of 40%, 50% and 60% are likely to be based on relatively small samples.)

Another probability-related metric that’s interesting to look at is how the model’s log probability score varies, on average, across the course of a game. We would expect that it would be relatively low to start with while the outcome is (usually) most uncertain and rise as the game progresses and the outcome (usually) becomes more certain.

Consider, for example, a game where, initially, the home team probability is assessed as being a 60% chance of winning. On the assumption that our model is well-calibrated, our expected log probability score at that point in the game is given by 0.6 * (1+log(0.6,2)) + 0.4 * (1+log(1-0.6,2)), which is about 0.03.

With 10 minutes to go, the home team leads by 16 points and our model estimates their probability of winning to be 90%. Our expected log probability score at that point in the game is given by 0.9 * (1+log(0.9,2)) + 0.1 * (1+log(1-0.9,2)), which is about 0.53.

The chart below reveals that the mean log probability score for our model does, indeed, rise as we look at points later and later in the game. The fact that it starts close to 0.2 and gets to 0.6 by three-quarter time reveals that a large number of games have relatively little uncertainty much earlier than we (well, certainly, I) might have expected. A mean of 0.2 reflects a probability around 75%, and one of 0.6 reflects a probability around 92%.

Since we’ve gone to the trouble of fitting quantiles, we should also care about how well-calibrated they are (on, to be more technically correct, whether they provide appropriate coverage). So, for example, we want the final margin estimates provided by our model for the 5th percentile to be exceeded only about 5% of the time, and likewise for the other percentiles.

To test this we’ll gather the estimates for 5 of the quantile models at each of the 99 different points in the game and calculate how often, on average their final margin estimates were exceeded.

A particular quantile model is well-calibrated across all parts of the game if the line charting its performance is roughly a straight line with a value equal to the percentile it’s for.

So, for example, the top line in the chart above, which is for the model of the 90th percentile, should, ideally, be a straight line with a value of 90% across all game fraction values. We can see that it mostly is, except for the latter portions of the game, which is true to varying extents for all of the lines shown here (which are for the 75th, 50th, 25th, and 10th percentiles).

For the reasons outlined earlier, we can probably explain the deviations for the last 5% or so of the game and consider them close to unmodellable without additional data, but there do seem to be some signs of poorer calibration a little earlier in the final term, though the deviations are mostly around 5% points or less.

Nonetheless, this chart does suggest that some additional work might need to be done to the model to improve the coverage of final margin confidence interval forecasts for the second halves of final terms. In the meantime, a mental adjustment to them is appropriate.

One final metric that might be of interest is the mean absolute forecast error of the median quantile model across the game. In other words, on average and as the game progresses, how different is the final margin from the forecast provided by the median quantile model?

We see that, at the start of a game, the mean absolute error of the median model forecast is just under 30 points per game. That falls to about 20 points per game by half time, and 14 points per game by three-quarter time.


As the last piece for this blog, let’s look at the model’s performance, in-running, for an actual game from the sample.

The top chart tracks the final margin forecasts of 5 of the quantiles against the in-running, actual margin, shown in bright green. The pre-game estimate is that from the bookmaker’s line market handicap.

The bottom chart tracks the in-running home team probability estimate. The pre-game estimate is that from the bookmaker’s prices in the head-to-head market, and is calculated as Away Team Price / (Home Team Price + Away Team Price).


It’s clear that this model provides well-calibrated probability estimates for, at least, all but the last minute or so of a game, and provides confidence intervals with acceptable coverage for, at least, all but the last five or ten minutes of a game (longer if we’re only interested in the median forecast).

It might be possible to improve the coverage in the latter parts of games by including additional terms in the model, and this is something I plan to investigate.

I’m also keen to see how well this model will do on games from 2017 to 2019, none of which have been used in its construction.