A Preference For Human Error

Randomness is - as Carl Sagan is oft credited as first declaring - clumpy, and confirmation bias is seductive, so it might be entirely coincidental and partly self-fulfilling that I've lately been noticing common themes in the attitudes towards the statistical models that I create for different people and purposes.

A little context. Some of you will know that as well as building statistical models to predict the outcome of AFL contests as a part-time (sic) hobby, I do similar things during the week, applying those same skills to problems faced by my corporate clients. These clients might, for example, want to identify customers more likely to behave in a specific way - say, to respond to a marketing campaign and open a new product - classify customers as belonging to a particular group or "segment", or talk to customers most at risk of ending their relationship with the client's organisation.

There are parallels between the processes used for modelling AFL results and those used for modelling consumer behaviours, and between the uses to which those models are put. Both involve taking historical data about what we know, summarising the key relationships in that data in the form of a model, and then using that model to make predictions about future behaviours or outcomes.

In football we seek to find relationships between teams' historical on-field performances - who it was that the teams played, where and when, and how they fared - and to generalise those relationships in the form of a model that can be applied to the same or similar teams in future games. With consumers, we take information about who they are, what products and services they use, and how they have used or stopped using those products and services in the past, generalise those relationships via a model, then use that model to predict how the same or similar consumers will behave in the future. 

For both domains that seems a perfectly reasonable approach to take to prediction. Human beings do much the same thing to make their own predictions, albeit intuitively and usually not as systematically or thoroughly. We observe behaviours and outcomes, extract what we think are relevant and causal features of what we observe, generalise that learning and then apply it - sometimes erroneously - to what we perceive to be similar situations in the future. This is how both prejudice and useful generalisations about people and the world are formed. Even intuition, some suggest, is pattern-recognition rendered subconscious.

Now the statistical models we build with data aren't perfect - as George E P Box noted, "... all models are wrong, but some are useful" - but then neither are these heuristic "models" that we craft for ourselves to guide our own actions and reactions. 

Which brings me to my first observation. Simply put, errors made by humans and errors made by statistical models seem to be treated very differently. Conclusions reached by humans using whatever mental model or rule of thumb they employ are afforded much higher status and errors in them, accordingly, forgiven much more quickly and superficially than those reached by a statistical model, regardless of the objective relative empirical efficacy of the two approaches.

This is especially the case if the statistical model is of a type sometimes referred to as a "black box", which are models whose outputs can't be simply expressed in terms of some equation or rule involving the inputs. We humans seem particularly wary of the outputs of such models and impervious to evidence of their objectively superior performance. It's as if we can't appreciate that one cake can taste better than another without knowing all of the ingredients in both.

That's why, I'd suggest, I'll find resolute resistance by some client organisations to using the outputs of a model I've created to select high-probability customers to include in some program, campaign or intervention simply because it's not possible to reverse-engineer why the model identified the customers that it did. Resistance levels will be higher still if the organisation already has an alternative - often unmeasured and untested - rule-of-thumb, historical, always-done-it-that-way-since-someone-decided-it-was-best basis for otherwise selecting customers. There's comfort in the ability to say why a particular customer was included in a program (or excluded from it), which can override any concern about whether greater efficacy might be achieved by choosing a better set of customers using some mysterious, and therefore untrustworthy statistical model.

It can be devastating to a model's perceived credibility - and a coup for the program 'coroner', who'll almost certainly have a personal and professional stake in the outcome - when a few of the apparently "odd" selections don't behave in the manner predicted by the model. If a customer flagged by a model as being a high defection risk turns out to have recently tattooed the company logo on their bicep, another example of boffin madness is quickly added to corporate folklore. 

I find similar skepticism from a smaller audience about my football predictions. No-one thinks that a flesh-and-blood pundit will be capable of unerringly identifying the winners of sporting contests, but the same isn't true for those of us drawing on the expertise of a a statistical model. In essence the attitude from the skeptical few seems to be that, "if you have all that data and all that sophisticated modelling, how can your predictions ever be incorrect, sometimes by quite a lot?" The subtext seems to be that if those predictions are capable of being spectacularly wrong on some occasions, then how can they be trusted at all? "Better to stick with my own opinions however I've arrived at them", seems to be the internal monologue, "... at least I can come up with some plausible post-rationalisations for any errors in them".

But human opinions based on other than a data-based approach are often at least as spectacularly wrong as those from any statistical model, but these too either aren't subjected to close post-hoc scrutiny at all, or are given liberal licence to be explained away via a series of "unforeseeable" in-game events.

Some skeptics are of the more extreme view that there are things in life that are inherently unpredictable using a statistical approach - in the sense of "incapable of being predicted" rather than just "difficult to predict". The outcome of football games is seen by some as being one such phenomenon. 

Still others misunderstand or are generally discomfited by the notion of probabilistic determinism. I can't say with certainty that, say, Geelong will win by exactly 12 points, but I can make probabilistic statements about particular results and the relative likelihood of them, for example claiming that a win by 10 to 20 points is more likely than one by 50 to 60 points.

With this view often comes a distrust of the idea that such forecasts are best judged in aggregate, on the basis of a number of such probabilistic assessments of similar events, rather than on an assessment-by-assessment basis. It's impossible to determine how biased or well-calibrated a probabilistic forecaster is from a single forecast, or even from just a few. You need a reasonably sized sample of forecasts to make such an assessment. If I told you that I was able to predict coin tosses and then made a single correct call, I'm sure you wouldn't feel that I'd made my case. But if I did it 20 times in a row you'd be compelled to entertain the notion (or to insist on providing your own coin).

None of this is meant to deride the forecasting abilities of people who make their predictions without building a formal statistical model - in some domains humans can be better forecasters than even the most complex model - but, instead, is an observation that the two approaches to prediction are often not evaluated on the same, objective basis. And, they should be.

WHAT CAN FORECASTERS DO?

As modellers, how might we address this situation then? In broad terms I think we need to more effectively communicate with clients who aren't trained as statistical modellers and reduce the mystery associated with what we do.

With that in mind, here are some suggestions:

1. Better Explain the Modelling Process

In general, we need to better explain the process of model creation and performance measurement to non-technical consumers of our model outputs, and be willing to run well-designed experiments where we compare the performance of our models with those of the internal experts, with a genuine desire to objectively determine which set of predictions are better. 

"Black-box" models are inherently harder to explain - or, at least, their outputs are - but I don't think the solution is to abandon them, especially given that they're often more predictively capable than their more transparent ("white box"?) alternatives. Certainly we should be cautious that our black-box models aren't overfitting history, but we should be doing that anyway.

Usually, if I'm going to build a "black-box" model I'll build a simpler one as well, which allows me to compare and contrast their performance, discuss the model-building process using the simpler model, and then have a discussion to gauge the client's comfort level with the harder-to-explain predictions of the "black-box" model.

(Though the paper is probably a bit too complex for the non-technical client, it's apposite to reference this wonderful Breiman piece on Prediction vs Description here.)

2. Respect and Complement (... and compliment too, if you like) Existing Experts

We need also to be sensitive to the feelings of people whose job has been the company expert in the area. Sometimes, as I've said, human experts are better than statistical models; but often they're not, and there's something uniquely disquieting about discovering your expertise, possibly gleaned over many years, can be trumped by a data scientist with a few lines of computer code and a modicum of data.

In some cases, however, the combined opinion of human and algorithm can be better than the opinion of either, taken alone (the example of Centaur Chess provides a fascinating case study of this). An expert in the field can also prevent a naive modeller from acting on a relationship that appears in the data but which is impractical or inappropriate to act on, or which is purely an artifact of organisational policies or practices. As a real life example of this, a customer defection model I built once found that customers attached to a particular branch were very high defection risks. Time for some urgent remedial action with the branch staff then? No - it turned out that these customers had been assigned to that branch because they had loans that weren't being paid. 

3. Find Ways of Quantitatively Expressing Uncertainty

It's also important to recognise and communicate the inherent randomness faced by all decision-makers, whether they act with or without the help of a statistical model, acknowledging the fundamental difficulties we humans seem to have with the notion of a probabilistic outcome. Football - and consumer behaviour - has a truly random element, sometimes a large one, but that doesn't mean we're unable to say anything useful about it at all, just that we're only able to make probabilistic statements about it.

We might not be able to say with complete certainty that customer A will do this or team X will win by 10 points, but we might be able to claim that customer A is 80% likely to perform some behaviour, or that we're 90% confident the victory margin will be in the 2 to 25 point range. 

Rather than constantly demanding fixed-point predictions - customer A will behave like this and customer B will not, or team X will win by 10 points - we'd do much better to ask for and understand how confident our forecaster is about his or her prediction, expressed, say, in the form of a "confidence interval".

We might then express our 10 point Home team win prediction as follows: "We think the most likely outcome is that the Home team will win by 10 points and we're 60% confident that the final result will lie between a Home team loss by 10 points and a Home team win by 30 points". Similarly, in the consumer example, we might say that we think customer A has a 15% chance of behaving in the way described, while customer B has a 10% chance. So, in fact, the most likely outcome is that neither customer behaves in the predicted way, but if the economics stacks up it makes more sense to talk to customer A than to customer B. Recognise though that there's an 8.5% chance that customer B will perform the behaviour and that customer A will not.

Range forecasts, confidence intervals and other ways of expressing our uncertainty don't make us sound as sure of ourselves as point-forecasts do, but they convey the reality that the outcome is uncertain.

4. Shift the Focus to Outcome rather than Process

Ultimately, the aim of any predictive exercise should be to produce demonstrably better predictions. The efficacy of a model can, to some extent, be demonstrated during the modelling process by the use of a holdout sample, but even that can never be a complete substitute for the in-use assessment of a model.

Ideally, before the modelling process commences - and certainly, before its outputs are used - you should agree with the client on appropriate in-use performance metrics. Sometimes there'll be existing benchmarks for these metrics from similar previous activities, but a more compelling assessment of the model's efficacy can be constructed by supplementing the model's selections with selections based on the extant internal approach, using the organisation's expert opinions or current selection criteria. Over time this comparison might become unnecessary, but initially it can help to quantify and crystallise the benefits of a model-based approach, if there is one, over the status quo.

******

If we do all of this more often, maybe the double standard for evaluating human versus "machine" predictions will begin to fade. I doubt it'll ever disappear entirely though - I like humans better than statistical models too.