At the moment I'm preparing a presentation for a couple of analyst and R user groups based heavily on an earlier blog from here on MatterOfStats describing the ten most surprising things I've learned while analysing AFL data. I'll post that presentation on MatterOfStats once I've delivered it.
In the meantime, as part of the preparation I've been refamiliarising myself with the idea of Pythagorean Expectation and its application to the historical home-and-away results for the VFL/AFL. Put simply the Pythagorean Expectation approach proceeds by assuming that teams' winning percentages can be related to their points scoring performance over a series of games by an equation of the form:
When Bill James first applied this approach to the winning percentages of Major League Baseball (MLB) teams (using "Runs" instead of "Points" in the equation) he found that setting k=2 gave him an acceptable fit. The resulting equation reminded him of Pythagoras' Theorem and so the term "Pythagorean Expectation" was coined.
We can simplify the equation a little as shown at right, which makes obvious the connection between the equation and the AFL's Percentage metric. The term in brackets in the denominator is the inverse of a team's Percentage (divided by 100, to be precise).
Subsequent analysis has suggested that a value of k equal to 1.83 provides a superior fit for MLB teams and analyses by others, which I found scattered over the web, have found the following values to be "optimal" for some other major sports:
- EPL : 1.3
- NHL : 2.15
- NFL : 2.37
- NBA : 13.91
Yet more sports to which the equation might usefully be applied, but for which I've been unable to find the requisite analyses, include Rugby League, Rugby Union, Squash, Tennis, Badminton, and Table Tennis. In short, the method can be applied to any sport where the scoring is such that one team "scores" and the other "concedes".
APPLICATION TO THE VFL/AFL
Using the nls function in R and fitting the entire history of VFL/AFL yields the result shown at right. What's astonishing about this equation is that it explains over 88% of the variability in teams' winning percentages despite being applied with the same exponent to 1,406 different teams over 117 years of football history.
One way of interpreting this equation is that it provides a means for converting a team's expected points-scoring superiority into a probability of victory. For example, a team that is expected to score 5% more points than its opponent has a victory probability of 54.7% as shown by the derivation at left.
Interpreted in this way the value of k in the equation becomes a measure of how much a sport "rewards" by way of greater certainty of victory any given superiority in expected points scoring, expressed as a ratio of expected points scored to points conceded. Sports with smaller values of k reward such superiority less and, in this sense, might be seen as being more prone to "lucky" outcomes. The graph on the right shows, for the sports listed earlier and their respective k values, how they each reward points-scoring superiority.
Thinking about the sports shown here and their optimal k values, one possibility suggested to me by this chart is that the optimal k for a sport is at least partly linked to the number of points scoring opportunities typically realised during a game of that sport and the number of points typically associated with each of those opportunities. In the EPL, for example, aggregate scores are typically low - say 1 or 2 goals - and each score is worth a single "point". So, even if a team is expected to score 10% more goals than its opponent, just a single random fortuitous event can reverse the outcome of the game and see the inferior team emerge victorious.
Compare that with the situation for the AFL where a typical game yields about 50 scoring opportunities, each worth about 3.7 points, so a team expected to score 10% more than its opponent - winning say 97 to 88 - would need the weaker team to secure about 3 more scoring shots than expected in order for the stronger team to lose.
APPLICATION TO THE NRL
As one way to test my hypothesis about the relationship between k and points scoring opportunities and their size I decided to apply the Pythagorean Expectations approach to the 106 years of NRL history (using, for anyone curious to know, the NRL and not the Super League data for 1997).
For visitors to MatterOfStats who are familiar only with the AFL portion of the Australian winter sporting calendar, Rugby League is also played here at roughly the same time as Australian Rules, and the NRL is the national Rugby League competition. It started later than the VFL, in 1908, but has also been contested every year without pause since it commenced.
The result of fitting the Pythagorean Expectations equation to NRL history, shown at right, is about what I expected. An exponent of 1.89 makes the NRL most similar, amongst the sports whose k's I've listed, to the MLB. I'd venture that the number of scoring opportunities in both is broadly similar too - a blog for another day perhaps.
It's interesting to note how well the Equation can also be made to fit NRL home-and-away season results. The R-squared of 87.2% is just 1% point less than the fit for the VFL/AFL data.
Quite apparently, the Pythagorean Expectations approach is applicable to a wide and diverse set of sports. For the more mathematically and statistically minded reader this paper by Steven J Miller provides a rationale for its applicability under the assumption that the teams' scores in a contest are distributed as independent Weibull random variables.
For the VFL/AFL and the NRL analyses so far I've been fitting a single equation to the entire expanse of their respective histories. But the Pythagorean Expectations approach can be applied to any subset of games - indeed one of its original uses was to project the final winning percentage of MLB teams mid-season - so I've also fitted separate equations to each season individually for both the VFL/AFL and the NRL team data.
Here, firstly, are the results for the VFL/AFL.
The chart at the top tracks the value of k for each season, the blue line a loess fit to reveal the underlying trend for the value of k. It shows that, for much of the period since about 1930, the optimal value of k in each season has tracked either side of 4, rarely moving outside the 3 to 5 range and trending very slightly downwards over that period. Prior to 1930, optimal k values tended to be much lower, driven at least in part, according to my hypothesis, by the lower levels of scoring opportunities that prevailed at the time.
(Actually there is some empirical support for my hypothesis from this season-by-season analysis of the VFL/AFL. An Ordinary Least Squares regression of the 117 seasons with Average Points per Scoring Shot and Number of Scoring Shots per Game as regressors and with the optimal k value as the target variable reveals that the two regressors together explain over 40% of the variability in optimal k. Further, the larger the number of Scoring Shots per Game and the smaller the average Score per Scoring Shot, the larger the value of k.)
The lower chart tracks the correlation between teams' fitted and actual winning percentages. For most seasons this correlation is 90% or higher, indicating a very high level of agreement.
Below are the same charts this time fitted to NRL history on a season-by-season basis.
Here we find that the optimal value of k oscillates around 2 from about the late 1930s onwards, never breaching the range from 1.5 to 3. Prior to that it varied mostly in the 1.5 to 2 range.
As for the VFL/AFL, the fit in most seasons is 90% or above, although it's been in mild decline over about the last 30 seasons and has been very low on more occasions than it has for the VFL/AFL analysis. The 1933 season was a particularly difficult one to fit due partly to the fact that, in that year, St George won 57% of their games but scored 9 points fewer than they conceded, while University won only 39% of their games but scored 2 points more than they conceded. In addition, only eight teams participated in that season.
The Pythagorean Expectations approach has been used by others to model the winning percentages of teams in a wide variety of sports. We've seen in this blog that the methodology provides an excellent fit to Australian VFL/AFL and NRL results too.
Optimal values of k are generally larger for VFL/AFL seasons than for NRL seasons, which implies that a VFL/AFL team with a given level of points-scoring superiority over its opponents (measured by the expected ratio of their score to their opponent's) enjoy a larger victory probability than an NRL team with the same level of superiority.