Estimating Game Margin Variability: Empirical Challenges

Lately I've been thinking a lot and writing a little - a mix that experience has taught me is nearer optimal - about the variability of game margins around their expected values.

Though the topic of game margin variability is interesting for its own sake, my continued exploration of it is motivated by more than just intellectual curiosity. Understanding the nature of the variability of game margins has implications for how accurate we might reasonably expect our own game margin predictions to be, and for how those expected game margins should best be converted into estimates of victory probability given that game margins are assumed to come from some statistical distribution (for example, a Normal distribution). Put simply, in a game where the final margin is expected to be more variable, a given superiority, measured in terms of points, is less likely to result in victory. More crisply: underdogs love variability, favourites don't.

Now in the past, I've assumed that the variability of game margins was constant across all games regardless of their defining characteristics, but more recently, based mostly on a theoretical model of scoring, I've begun to wonder if some games' margins might inherently be more variable.

Specifically, if my scoring model is a reasonable approximation of how actual scoring relates to pre-game expectations, then we should see higher game margin variability in games that are expected to be higher scoring affairs. So, for example, we'd expect to see higher variability in the final game margin about the expected 30-point margin for a contest expected to finish 110-80 than we would for one expected to finish 90-60.

To briefly recap, the scoring model I've been using, assumes that:

  • the Scoring Shots of the home and away teams are distributed as a Bivariate Negative Binomial
  • the conversion of these Shots into goals follows a univariate Beta Binomial distribution for both teams

Values for the various parameters of these distributions can be estimated in a fashion from historical data (as described in this blog) and these parameterised distributions then used to simulate games where different assumptions are made about the number of Scoring Shots each team are expected to make. 

In the previous blog I noted that the link between game margin variability and scoring expectations might be explored empirically and more directly with sufficient historical bookmaker data. If we had details of both the pre-game line and the under/over markets we could investigate directly the relationship between the margin error (the difference between the final margin and the expected margin, which is the negative of the line handicap) and the bookmaker's total game score expectations (the aggregate points value in the under/over market set by the bookmaker about which he expects it's equally likely that the teams will score more or less, combining their final scores).

Assuming we had enough games with the same (or similar) under/over value, we could calculate the variability of the margin errors for these games and then, considering all the under/over values for which sufficient data was available, estimate the relationship between this variability and the total score expectations.

It turns out though that the requirement for "enough games" is startlingly onerous because the sample size required to estimate a variance with a given level of precision is considerably greater than that required to estimate, say, a mean with the same level of precision. In our case, assuming that game margins are approximately Normally distributed, roughly speaking we'd need about 50 times (ie 35 times the square root of 2) as much data to estimate with the same level of precision the standard deviation of game margins for a particular type of game as we'd need to estimate the mean game margin for that same game type.

Some examples might help to tease out some of the implications of this.

Four game-type example

Imagine that there are only four types of game, defined by the expected number of Scoring Shots for the Home and Away teams:

  • Type A: Home Team Expected Scoring Shots = 20; Away Team Expected Scoring Shots = 20
  • Type B: {23, 20}
  • Type C: {25, 20}
  • Type D: {28, 20}

Assuming a 53% Conversion Rate for both teams, Type A games are expected to generate about 146 points, Type B about 157 points, Type C about 164 points, and Type D about 175 points. Type A games are also expected to finish in a draw, Type B with an 11 point Home win, Type C with an 18 point Home win, and Type D with a 29 point Home win.

Now assume further that we have a sample of 50 of each game Type that we'll use to calculate the standard deviation of game margin errors. In other words, we'll take the 50 actual game margins for games of Type A, subtract the expected margin from each of them (zero in this case) and then calculate the standard deviation of the resulting data set. We'll then do the same for the three other Types.

We'll generate the four samples of 50 games used in the calculations via the scoring model I described earlier. Repeating the process 10,000 times reveals that the true standard deviations are ordered, as we'd expect, from Type A to Type D (ie from games expected to be lower scoring to games expected to be higher scoring), with Type A games producing margin errors with a standard deviation of about 31.5 points, Type B of 32.5 points, Type C of 33.5 points, and Type D of 34.5 points.

One practical question then is: how often can we expect a single sample of 50 games of each Type to produce standard deviations in the correct order, that is, with Type A game margin errors having a smaller standard deviation than Type B game margin errors, Type B game margin errors having a smaller standard deviation than Type C game errors, and Type C game margin errors having a smaller standard deviation than Type D game errors?

The answer from the simulations is only about 9% of the time. (For comparison, the mean margins of the four Types will be correctly ordered, those same simulations show, about 77% of the time.) 

So variable are the estimates of the standard deviations of the margin errors, in fact, that in 26% of simulations the 50 Type A games have margin errors with a larger standard deviation than the 50 Type D games. Never though do the mean margins produce this same mis-ordering.

The highly variable nature of sample estimates of the standard deviation of game margins makes drawing inferences about them based on smallish samples very difficult.

VARIABILITY of margin error for a single game-type

To illustrate that highly variable nature I generated the chart below. It summarises the results for 10,000 samples of 50 games, with each game in each sample generated assuming that the Home team was expected to create 25 scoring shots and the Away team was expected to create the same number.

Based on the data in this chart, the 95% confidence interval for the true standard deviation runs from 27.4 points to 41.0 points. That's huge. Even a 50% confidence interval is quite wide and runs from 31.7 points to 36.4 points. (The median, by the way, is 34.1 points.)

Now the assumption that we might have available a sample of 50 games of any particular type is fairly generous and the precision of any estimate we might produce from such a sample is, of course, better than we might produce from smaller samples. With just 20 game samples of each Type, for example, we obtain the correct ordering of margin error standard deviations just 8% of the time and have the standard deviation for Type A games larger than that for Type D games 34% of the time.


In short, it's very difficult to empirically verify that game margin variability about its expectation is positively associated with the expected aggregate score - or with any other game feature, for that matter - based on small samples sizes.

So, I'm going to need lot of data to empirically explore this topic further, or a smarter analytic technique; donations and suggestions welcomed.