Predictors Behaving Badly: Intransitivity Revisited

If I told you that I prefer green to blue, and blue to red, you'd probably assert that, logically, I should prefer green to red. In technical terms you'd be demanding that my colour preferences exhibit the property known as transitivity - one of the axioms of expected utility theory.

Even though this might seem like a reasonable assumption, empirical studies with real humans in real situations demonstrates that this is not always the case. People's preferences can be resolutely intransitive - a state of affairs that some suggest implies an irrationality on the part of individuals and that others find completely explicable.

Competitive Intransitivity

I'm not especially surprised or concerned about the fact that someone's - even my own - preferences might display intransitivity, but I was startled when first I came across intransitivity in a purely competitive context when I read about non-transitive dice. Amongst the set of dice described in that Wikipedia article, none can be described as the "best" in that, head-to-head, it beats the other two. If I allow you to choose any one of the die to roll I can choose another which is more likely than not to produce a roll higher than yours, regardless of which one you choose. So, amongst these dice the property of "likely to roll the higher number when pitted against another die" is intransitive.

(The Wikipedia on the voting paradox identifies a similar issue.)

What's the link to MAFL? Well, in the last blog, I was looking for a model that would, in more than half the games, predict nearer the Home team or the Away team score than another model, which got me to wondering about which of the Margin Predictors from last season was best at getting nearest the actual final margin. In other words, if the final result of a game was Geelong by 6 points, which Margin Predictor's margin prediction was nearest to this outcome.

It turns out that Bookie_3 was the best Margin Predictor of season 2011 on this metric. It was closest to the actual final margin amongst all Margin Predictors in almost 18% of games. Next best was Combo_NN_2, which was closest in just under 17% of contests, then Combo_NN_1, which was closest in just under 14%. Much further down the list was Combo_7, which was nearest the actual margin in less than 5% of games.

But here's the weird bit: considered head-to-head, Combo_7 was nearer the final margin than Combo_NN_2 in almost 56% of games (odd enough in itself given the numbers in the previous paragraph), Combo_NN_2 was nearer the final margin than Combo_NN_1 in 51% of games, and Combo_NN_1 was nearer the final margin than Combo_7 in 50.5% of games. In short, Combo_7 beats Combo_NN_2 head-to-head, which beats Combo_NN_1 head-to-head, which in turn beats Combo_7 head-to-head (which I'll write as C7 > CN2 > CN1 > C7). There's your intransitivity right there.

Inspection of the full set of pairwise results, tabulated below, uncovers many other such intransitivities.

(The coloured entries show how often the Margin Predictor named in the left-hand column predicted nearer the final margin than did the Margin Predictor named in the column heading. So, for example, Bookie_3 predicted nearer the final margin than Win_7 in 60.2% of games last season.

Please click the image for a larger version.)

In this table it's relatively easy to find short-chain intransitivities similar to the one I described earlier. For example we have: B9 > CN2 > HU10 > B9, W3 > HU3 > PP3 > W3, and HA3 > HA7 > W7 > HA3.

Perhaps most impressive is the following intransitive relationship, which takes in every Margin Predictor except Bookie_3 (which beats every other Predictor head-to-head) and ProPred_7 (which loses to every other Predictor head-to-head): C7 < CN1 < CN2 < B9 < HU10 < W7 < PP3 < HU3 < W3 < HA3 < C7.

With knowledge of this relationship I could let you choose any of the Margin Predictors (except Bookie_3) and then choose another Predictor with a superior head-to-head record. If you chose W7, for example, I'd choose the Margin Predictor directly to its right, PP3, and thereby select a Margin Predictor with a 51% record over yours.

So, Bookie_3 aside, there's no best Margin Predictor, at least in the sense that the selected Margin Predictor should have a superior head-to-head record to every other Predictor in terms of predicting margins closer to the actual margin.

Part of the reason for this curious behaviour is, I think, the unusual nature of the metric "closer". Being better than Predictor B on this metric requires only that a Predictor A's accurate predictions be well-timed in relation to B's - and that's all it requires.

Okay, so the pairwise head-to-head results display intransitivity and are, in that sense, hard to explain. We can't expect to find a single metric for each Predictor that will correlate highly with its head-to-head performance because a single metric implies a unique ordering yet we know that head-to-head results are intransitive.

What about the Predictors' overall results - the proportion of times that they're the closest of all Predictors? Can I find a metric to explain this performance because, without one, it'll be difficult to deliberatively construct models that are strong performers in this type of contest?

My first instinct was that performance might be related to a Predictor's mean or median absolute prediction errors, but, as the two pairs of columns rightmost in the table above show, this intuition turns out to be wrong. For example, some of the Predictors that are most often closest amongst all Predictors to the final margin - Combo_NN_1, Bookie_9 and ProPred 3 - have some of the worst mean APEs of all, and two of them also have poorly-ranking median APEs.

These metrics fail because, with "nearest" as the goal, being 1 point more distant from the actual margin than all another Predictors is no different from being 100 points more distant. Being nearest amongst all Predictors requires only that a Predictor produces the occasional extremely accurate predicton - the accuracy of its other predictions is completely irrelevant. Consequently, it's possible to be closest relatively often but, on average or more than half the time, to be quite far away.

(These two APE measures are a somewhat better indicator, however, of the likely relative performance of the Predictors in finishing in, say, the top few places, as evidenced by the data in the table above labelled '% nearest 3', which records the proportion of games in which a Predictor's prediction were amongst the three nearest to the actual margin. Median and Mean APE are therefore solid measures of a Predictor's general competence but not of its likelihood to excel.)

Instead then, perhaps metrics about how often a Predictor is very close to the actual margin might be better explanators of its overall performance. The table below shows the percentage of games in which the Predictor was within X points of the actual result, for various values of X.

Bookie_3's strong overall performance can be explained by its impressive set of numbers in this table. Whatever points threshold you deem as being "close" to the actual result, Bookie_3 is within the top handful of Predictors in terms of being this close.

The table also hints at why Combo_7 does so well head-to-head but rarely achieves the status of nearest Predictor to the actual margin. Combo_7 is as good as Bookie_3 and better than all other Predictors except HA3 in producing predictions that are within a single point of the actual margin, but it's poorer than almost every other Predictor in producing predictions that are otherwise very close. 

Those observations aside, the results in this table don't shed much light on Predictors' overall performance in being nearest the final margin. If you consider the rankings of the Predictors on any of the columns in the table above you get an ordering that's very different from the ordering based on how often each Predictor is nearest the actual margin. 


I'm left with the conclusion that, at least in the area of score prediction, a model's ability to be best is largely uncorrelated with its ability to be generally good. Put another way, a model that tends to finish in the top few places on some performance metric relatively often will not necessarily be one that also tends to finish top more often. 

Bookie_3 is the obvious exception here but, that model aside, you might as soon prefer to use Combo_7, HA3 or even HU3 depending on the specific nature of the modelling performance you required.

When you're building and selecting the best statistical model, it really matters what you mean by "best".