August 16, 2010

Letting the Computer Do (Most of) the Work

August 16, 2010/ Tony Corke

Around this time of year it's traditional to work through the remaining matches for each team and attempt to codify what each needs to do in order to secure a particular finish - minor premiership, top 4, top 8 or Spoon.

This year, rather than work through all the combinations manually, I've decided to be lazy - purely for instructional purposes, I should add - and enlist the help of rule induction, a mathematical technique for deducing from a dataset statements in the form If A and B then C that describe key variables in that data.

So, for example, if you were to apply the technique to help describe the use of heating and cooling appliances by a household over the course of a few years you might collect information several times each day about who was home, what the outside temperature was, what day of the week and time of day it was, and whether or not a heating or a cooling appliance was turned on.

Using a rule induction algorithm, you'd be able to come up with statements such as this one:

If Number of People Home is greater than 0 AND Outside Temperature is less than 15 degrees AND Time of Day is between 5:30pm and 11:30pm AND Day of Week is not Saturday or Sunday then Heating = ON (Probability 92%)

For this blog I provided a rule induction algorithm (the JRip Weka algorithm running in R, if you're curious) with the outputs from 10,000 of the simulations I used in my earlier blog, which included for each simulation:

The results of each of the remaining 16 games
The final ladder positions of each team if these were the actual results of each game

To simplify matters a little, and recognising that the main interest is not in exact ladder position finishes, I summarised each team's finishing position as either "1st","2nd to 4th","5th to 8th","9th to 15th", or "16th".

The goal was that the rule induction algorithm would output rules of the form:

If X beats Y AND X beats Z AND ... then X finishes 5th to 8th

Rule induction worked remarkably well. Here are a few real examples of the rules that the algorithm offered up for Collingwood's fate:

Rule 1: (Collingwood..v..Adelaide <= 0) and (Hawthorn..v..Collingwood >= 0) and (Carlton..v..Geelong <= 0) => Collingwood = 2nd to 4th (168.0/2.0)
Rule 2: => Collingwood =1st (9832.0/0.0)

Rule 1 can be interpreted as follows:

If Collingwood loses to or draws with Adelaide (ie the margin in that game, couched in terms of Collingwood is less than or equal to zero) AND Collingwood loses to or draws with Hawthorn AND Geelong beats or draws with Carlton then Collingwood finish 2nd to 4th.

What's implicit here is that Geelong also beats West Coast but since, in the simulations, this always occurred when the other conditions in the rule were met, the algorithm didn't realise that this was an additional required condition.

As well, Collingwood can't be allowed to draw both its games otherwise Geelong can't overhaul them. Again, this situation didn't occur in the simulations I provided the algorithm, and not even the smartest algorithm can intuit instances that it's never seen.

I could probably have fixed both of these shortcomings by providing the algorithm with more than 10,000 simulations, though I'd pay a price in terms of computation time. Note though the (168.0 / 2.0) annotation at the end of this rule. That tells you that the rule could be applied to 168 of the simulations, but that it was wrong for 2 of them. Maybe the two simulations for which the rule applied but was incorrect included a Geelong loss to the Eagles or two draws for Collingwood.

Rule creation algorithms include what's called a "stopping rule" to prevent them from creating a unique rule for every simulation result, which might make the rules highly accurate but also makes them completely impractical.

Rule 2 is the "otherwise" rule and is interpreted as the predicted outcome if none of the earlier rules' full set of conditions are met. For Collingwood, "otherwise" is that they finish 1st.

The rules provided for other teams were generally quite similar, although they became more complex for teams when percentages were required to determine crucial ladder positions. Here, for example, are a few of the rules where the algorithm is attempting to model Hawthorn getting bumped into 9th by Melbourne:

(Hawthorn..v..Fremantle <= -7) and (Port.Adelaide..v..Melbourne = -14) and (Melbourne..v..Kangaroos >= 20) and (Hawthorn..v..Collingwood = -39) and (Melbourne..v..Kangaroos Hawthorn = 9th to 15th (54.0/2.0)
(Hawthorn..v..Fremantle <= -4) and (Port.Adelaide..v..Melbourne = -7) and (Melbourne..v..Kangaroos = 11) and (Hawthorn..v..Collingwood = -59) and (Port.Adelaide..v..Melbourne = -32) = Hawthorn = 9th to 15th (41.0/3.0)

Granted that's a mite convoluted, but nothing that a human can't recognise fairly quickly, which nicely illustrates my experience with this type of algorithm: their outputs almost always contain some useful insights but the extraction of this insight requires human interpretation.

What follows then are the rules that man and machine have crafted for each team (note that I've chosen to ignore the possibility of draws to reduce complexity)

Collingwood

Finish 2nd to 4th if Collingwood lose to Adelaide and Hawthorn AND Geelong beat Carlton and West Coast
Otherwise finish 1st

Geelong

Finish 1st if Collingwood lose to Adelaide and Hawthorn AND Geelong beat Carlton and West Coast
Otherwise finish 2nd to 4th

St Kilda

Finish 2nd to 4th

Western Bulldogs

Finish 5th to 8th if Dogs lose to Essendon and to Sydney AND Fremantle beat Hawthorn and Carlton
Otherwise finish 2nd to 4th

Fremantle

Finish 2nd to 4th if Dogs lose to Essendon and to Sydney AND Fremantle beat Hawthorn and Carlton
Otherwise finish 5th to 8th

Carlton and Sydney

Finish 5th to 8th

Hawthorn

Finish 9th to 15th if Hawthorn lose to Fremantle and Collingwood AND Roos beat West Coast and Melbourne
Also Finish 9th to 15th if Hawthorn lose to Fremantle and Collingwood AND Melbourne beat Port and Roos sufficient to raise Melbourne's percentage above Hawthorn's
Otherwise finish 5th to 8th

Kangaroos

Finish 5th to 8th if Hawthorn lose to Fremantle and Collingwood AND Roos beat West Coast and Melbourne
Otherwise finish 9th to 15th

Melbourne

Finish 5th to 8th if Hawthorn lose to Fremantle and Collingwood AND Melbourne beat Port and Roos sufficient to raise Melbourne's percentage above Hawthorn's
Otherwise finish 9th to 15th

Adelaide, Port Adelaide and Essendon

Finish 9th to 15th

Brisbane Lions

Finish 16th if Lions lose to Essendon and Sydney AND West Coast beat Geelong and Roos sufficient to lift West Coast's percentage above the Lions' AND Richmond beat St Kilda or Port (or both)
Otherwise finish 9th to 15th

Richmond

Finish 16th if West Coast beat Geelong and Roos AND Richmond lose to St Kilda and Port Otherwise finish 9th to 15th

West Coast

Finish 9th to 15th if West Coast beat Geelong and Roos AND Richmond lose to St Kilda and Port Finish 9th to 15th if West Coast beat Geelong and Roos AND Lions lose to Essendon and Sydney sufficient to lift West Coast's percentage above the Lions'
Otherwise finish 16th

As a final comment I'll note that the rules don't allow for the possibility of Sydney or Carlton slipping into 4th. Although this is mathematically possible, it's so unlikely that it didn't occur in the simulations provided to the algorithm. (Actually, it didn't occur in any of the 100,000 simulations from which the 10,000 were chosen either.)

A quick bit of probability shows why.

Consider what's needed for Sydney to finish fourth.
1. The Dogs lose to Essendon and Sydney
2. Sydney also beat the Lions
3. Fremantle don't win both their games

Furthermore, combined, Sydney and the Dogs' results have to close the percentage gap between the two teams, which currently stands at over 25 percentage points.

But the 15% and 60% figures just relate to the probability of the required result, not the probability that the wins and losses will be big enough to lift Sydney's percentage above the Dogs'. If Sydney were to trounce the Lions by 100 points and Essendon were to do likewise to the Dogs, then Sydney would still need to beat the Dogs by about 91 points to achieve such a lift.

So let's revise the probability of 1 down to 0.01% (which is probably generous) and the probability of 2 down to 5% (which is also generous). Then the overall probability is 0.01% x 5% x 80%, or about 1 in 250,000. Not gonna happen.

(For similar reasons there are also no rules for Fremantle dropping a game but still grabbing 4th from the Dogs on the basis of a superior percentage.)

August 11, 2010

Sometimes the Hare Wins

August 11, 2010/ Tony Corke

Leading early has never been as predictive of the final outcome as it has been this season.

Consider the statistics. In the 150 games that have produced a clear winner, that winner has led 75% of the time at the 1st change, 76% of the time at the main break, and a startling 89% of the time at the final change. Put another way, only 16 teams have trailed at the final change - by any amount - and gone on to win.

If we exclude slender leads, come-from-behind victories all but vanish. Only 8 teams with a lead of 2 goals or more at quarter time have surrendered that lead, and only 2 teams with a lead of 3 goals or more have done similarly from that point. A lead of 2 goals or more at the main break has been insufficient on only 9 occasions, and a lead of 3 goals or more on only 5 occasions.

No team - not one - has surrendered a three-quarter time lead of 3 goals or more this season, and only 6 teams have lost after leading at the final change by just 1 goal or more.

In an historical context, these statistics are all anomalous, as are the statistics relating to the quarters won by winning teams.

Usually, the teams that win have differentially asserted their dominance in the 3rd or the 4th quarter of games. Whilst winning the 1st or 2nd quarters has always been of some importance, failing to do so has, in years past, not been a significant impediment to victory. This year, however, winning teams have dominated 1st quarters most of all - teams that have taken the competition points have won 75% of 1st terms, but only 67% of 2nd terms, 69% of 3rd terms, and 72% of 4th terms.

Lead early, lead often.

August 10, 2010

Playing the Percentages

August 10, 2010/ Tony Corke

It seems very likely that this season, some ladder positions will be decided on percentage, so I thought it might be helpful to give you an heuristic for estimating the effect of a game result on a team's percentage.

A little maths produces the following exact result for the change in a team's percentage:

(1) New Percentage = Old Percentage + (%S - %C)/(1 + %C) * Old Percentage

where

%S = the points scored by the team in the game in question as a percentage of the points it has scored all season, excluding this game, and

%C = the points conceded by the team in the game in question as a percentage of the points it has conceded all season, excluding this game.

(In passing, I'll note that this equation makes it obvious that the only way for a team to increase its percentage on the basis of a single result is for %S to be greater than %C or, equivalently, for %S/%C to be greater than 1. Put another way, the team's percentage in the most current game needs to exceed its pre-game percentage.

This equation also puts a practical cap on the extent to which a team's percentage can alter based on the result of any one game at this stage of the season. For a team with a high percentage the term (%S - %C) will rarely exceed 5%, so a team with, for example, an existing percentage of 140 will find it hard to move that percentage by more than about 7 percentage points. Alternatively, a team with an existing percentage of just 70, which might at the extremes produce a (%S - %C) of 7%, will find it hard to move its percentage by more than about 5 percentage points in any one game.)

As an example of the use of equation (1) consider Sydney, who have scored 1,701 points this season and conceded 1,638, giving them a 103.8 percentage. If we assume, since this is Round 20, that they'll rack up a score this week that's about 5% of what they've previously scored all season and that they'll concede about 4%, then the formula tells us that their percentage will change by (5% - 4%)/(104%) * 103.8 = 1 percentage point.

Now 5% x 1,701 is about 85, and 4% x 1,638 is about 66, so we've implicitly assumed an 85-66 victory by the Swans in the previous paragraph. Recalculating Sydney's percentage the long way we get (1,701+85)/(1,638+66), which gives a 104.8 percentage and is, indeed, a 1 percentage point increase.

So we know that the formula works, which is nice, but not especially helpful.

To make equation (1) more helpful, we need firstly note that at this stage of the season the points that a team concedes in a game are unlikely to be a large proportion of the points they've already conceded so far in the entire season. So the (1+C%) in equation (1) is going to be very close to 1. That allows us to rewrite the equation as:

(2) Change in Percentage = (%S - %C) * Old Percentage

Now this equation makes it a little easier to play some what-if games.

For example we can ask what it would take for Sydney, who are currently equal with Carlton on competition points, to lift their percentage above Carlton's this weekend. Sydney's percentage stands now at 103.8 and Carlton's at 107.0, so Sydney needs a 3.2 percentage point lift.

Using a rearranged version of Equation (2) we know that achieving a lift of 3.2 percentage points from a current percentage of 103.8 requires that (%S - %C) be greater than 3.2/103.8, or about 3%. Now, if we assume that Sydney will concede points roughly equal to its season-long average then %C will be 1/19 or a bit over 5%.

So, to get the necessary lift in percentage, Sydney will need %S to be a bit over 5% + 3%, or 8%. To turn that into an actual score we take 8% x 1,701 (the number of points Sydney has scored in the season so far), which gives us a score of about 136. That's how many points Sydney will need to score to lift its percentage to around 107, assuming that its opponent this week (Fremantle) scores 5% x 1,638, which is approximately 82 points.

Within reasonable limits you can generalise this and say that Sydney needs to beat Fremantle by 54 points or more to lift its percentage to 107, regardless of the number of points Freo score. In reality, as Fremantle's score increase - and so %C rises - the margin of victory required by Sydney also rises, but only by a few points. A 60-point margin of victory will be enough to lift Sydney's percentage over Carlton's even in the unlikely event that the score in the Sydney v Freo game is as high as 170-110.

Okay, let's do one more what-if, this one a bit more complex.

What would it take for Melbourne to grab 8th spot this weekend? Well the Roos and Hawthorn would need to lose and the combined effect of Hawthorn's loss and Melbourne's win would need to drag Melbourne's percentage above Hawthorn's. Conveniently for us, Hawthorn and Melbourne meet this weekend. Even more conveniently, their respective points for and points against are all quite close: Hawthorn's scored 1,692 points and conceded 1,635; Melbourne's scored 1,599 and conceded 1,647.

The beauty of this fact is that, for both teams, in equation (2) Old Percentage is approximately 1 and, for any score, Hawthorn's %S will be approximately Melbourne's %C and vice versa. This means that any increase in percentage achieved by either team will be mirrored by an equivalent decrease in the percentage of the other.

All Melbourne needs do then to lift its percentage above Hawthorn's is to lift its percentage by one half the current difference. Melbourne's percentage stands at 97.1 and Hawthorn's at 103.5, so the difference is 6.4 and the target for Melbourne is an increase of 3.2 percentage points.

Melbourne then needs (%S-%C) to be a bit bigger than 3%. Since the divisors for both %S and %C are about the same we can re-express this by saying that Melbourne's margin of victory needs to be around 3% of the points it's conceded so far this season, which is 3% of 1,647 or around 50 points. Let's add on a few points to account for the fact that we need the margin to be a little over 3% and call the required margin 53 points.

So how good is our approximation? Well if Melbourne wins 123-70, Hawthorn's new percentage would be (1,692+70)/(1,635+123) = 1.002, and Melbourne's would be (1,599+123)/(1,647+70) = 1.003. Score 1 for the approximation. If, instead, it were a high-scoring game and Melbourne won 163-110, then Hawthorn's new percentage would be (1,692+110)/(1,635+163) = 1.002, and Melbourne's would be (1,599+163)/(1,647+100) = 1.003. So that works too.

In summary, a victory by the Dees over the Hawks by around 9-goals or more would, assuming the Roos lose to West Coast, propel Melbourne into the eight - not a confluence of events I'd be willing to wager large sums on, but a mathematical possibility nonetheless.

July 30, 2010

Line Betting : A Codicil

July 30, 2010/ Tony Corke

While contemplating the result from an earlier blog, which was that home teams had higher handicap-adjusted margins and won at a rate significantly higher than 50% on line betting - virtually regardless of the start they were giving or receiving - I wondered if the source of this anomaly might be that the bookie gives home teams a slightly better deal in setting line margins.

July 29, 2010

A Line Betting Enigma

July 29, 2010/ Tony Corke

The TAB Sportsbet bookmaker is, as you know, a man to be revered and feared in equal measure. Historically, his head-to-head prices have been so exquisitely well-calibrated that I instinctively compare any model I construct with the forecasts he produces. To show that a model historically outperforms leads me to scuttle off to determine what error I've made in constructing the model, what piece of information I've used that, in truth, was only available with the benefit of hindsight.

July 17, 2010

The Importance of a Team's Recent Form: What Bookies (and MARS) Think

July 17, 2010/ Tony Corke

When the TAB Sportsbet bookie is framing a market for an upcoming game, clearly one set of data that he uses is the recent results of the participating teams.

July 17, 2010

Super Smart is Taking Heed of Bookies

July 17, 2010/ Tony Corke

Across a series of blogs now we've explored the Super Smart Model (SSM) and investigated its ability to predict victory margins. In this blog we'll look more closely at which variables most influence SSM's forecasts.

July 15, 2010

Trialling The Super Smart Model

July 15, 2010/ Tony Corke

The best way to trial a potential Fund algorithm, I'm beginning to appreciate, is to publish each week the forecasts that it makes. This forces me to work through the mechanics of how it would be used in practice and, importantly, to set down what restrictions should be applied to its wagering - for example should it, like most of the current Funds, only bet on Home Teams, and in which round of the season should it start wagering.

July 13, 2010

Simplifying MARS Rating Updates: An Epilogue

July 13, 2010/ Tony Corke

In my previous blog I used Eureqa to find a simpler version of the equations for updating MARS Ratings. There I jumped straight to what I deemed the 'best' solution that Eureqa had found, glossing over a slew of perfectly adequate and much simpler solutions that it also found.

July 13, 2010

MARS Ratings Revisited: There Must Be a Simpler Way

July 13, 2010/ Tony Corke

It's official: Eureqa is an amazing tool. With all the recent model-building I've been undertaking and writing up here in various blogs, I've become more aware of the predictive power of MARS Ratings.

July 12, 2010

Predicting Head-to-Head Market Prices

July 12, 2010/ Tony Corke

In earlier blogs I've claimed that there's not much additional information in bookie prices that's useful for predicting victory margins than what can be derived from a statistical analysis of recent results and an understanding of game venues.

July 09, 2010

The Relationship Between Head-to-Head Price and Points Start

July 09, 2010/ Tony Corke

I've found yet another MAFL-related use for the Eureqa tool, this time to determine the precise relationship between a team's head-to-head price and the start it's giving or receiving on line betting. A simple plot of the history of a team's head-to-head price (or the probability that can be inferred from it) versus its start on line betting makes it obvious that there's a relationship between the two and that it's a non-linear one, but in the past I've been constrained by my own (lack of) ingenuity and persistence in generating sufficient possibilities to find its exact nature.

July 08, 2010

Building The Super Smart Model

July 08, 2010/ Tony Corke

In a previous blog we investigated the information content of bookie prices in the context of predicting the victory margin of a game.

July 07, 2010

What Do Bookies Know That We Don't?

July 07, 2010/ Tony Corke

Bookies, I think MAFL has comprehensively shown, know a lot about football, but just how much more do they know than what you or I might glean from a careful review of each team's recent results and some other fairly basic knowledge about the venues at which games are played?

June 08, 2010

In-Running Wagering: What's the Best Strategy?

June 08, 2010/ Tony Corke

With services such as Betfair now offering in-running wagering opportunities, the ability to accurately assess a team's chances of victory at any given point in a game is now of considerable commercial value. Imagine, for example, that your team, who are at home, lead by 18 points at the first change. Would a wager on them at $1.40 be advised?

June 04, 2010

In-Running Prediction of the Winner of an AFL Game

June 04, 2010/ Tony Corke

I've been planning to create this model for a while. With it, you can calculate the probability that the home team will eventually prevail given the state of the match at a particular point in the game.

June 01, 2010

Looking At Team Performance Quarter-By-Quarter

June 01, 2010/ Tony Corke

AFL Football - as the cliche goes - is a game of four quarters. The benefit of this arrangement is that AFL scores provide twice as much information about the ebb and flow of each contest as the scores of any other form of football in this country. With the quarter-by-quarter information alone we can perform some interesting analyses for every team.

May 20, 2010

Should We Have Been Surprised About the Season So Far?

May 20, 2010/ Tony Corke

Surprisals, you might recall, are a way of measuring the likelihood of the result of a chance outcome. They're measured in bits, and one bit of surprisal is the amount of surprise that you should feel in correctly predicting the toss of one unbiased coin.

May 17, 2010

Is the Home Ground Advantage Disappearing?

May 17, 2010/ Tony Corke

Home teams, as a whole, have not fared particularly well this season, which is one of the reasons that most of the MAFL Funds - all but one of which bet exclusively on home teams - have been testing Investors' patience.

April 26, 2010

Modelling AFL Team Scoring : Part III

April 26, 2010/ Tony Corke

This is the third in a series of blogs (here are Part I and Part II) about modelling the scoring of AFL teams and, with the heavy statistical lifting out of the way, in this blog we can look at the practical uses of what we've discovered so far

Statistical Analyses