The numbers on the left relate to the p-values for how likely it was that we would observe a number of runs equal to or less than the number that we actually observed given the null hypothesis of random scoring, and the numbers on the right relate to the p-values for how likely it was that we would observe a number of runs equal to or greater than the number that we actually observed given the null hypothesis of random scoring.
What this table suggests is that, if there is momentum in AFL scoring patterns, it has only a very subtle influence. For starters, we have only 12 games that provide evidence against the null hypothesis at the 10% level, which is only 2 more games than we'd expect to find with p-values in this range due to chance. Even if we look at the number of games delivering a p-value under 50% we've only an excess of 8 games relative to chance.
In one, quite technical way, the runs test makes it hard to detect momentum because the observed number of runs is a discrete rather than a continuous statistic and therefore carries non-zero probability. (I expect that this would be less of an issue if we had a larger sample, but that's to be determined on another day.) One practical consequence of this is a complication in determining statistical significance. If, for example, under the null hypothesis, only 3% of runs values are less than the value we observed, but 88% are greater - because the exact number of runs we observed has a 9% probability under the null hypothesis - is this result statistically significant at the 10% level or not? The p-value for such a game is 12% and so would be recorded in the table above in the 10-20% bucket. Generally, the discrete characteristic of the runs statistic will tend to push the p-values into higher buckets.
Putting that to one side for a moment, there is a formal test that we can use on the set of p-values that we've observed to ask if they, as a group, support or impugn the null hypothesis. It's the Fisher Test, which is described here, and which uses the statistic -2 x sum of the natural logs of the p-values that is distributed under the null hypothesis as a chi-squared variable with 2k degrees of freedom where the number of independent p-values you have is k. In our case, for the p-values on the left-hand side of the table, the statistic is 211.9, which itself has a p-value of 27%. Not even the most null-hypothesis loathing researcher uses an alpha of 30% for his or her hypothesis testing.
We can rescue the possibility of scoring momentum somewhat by looking instead at the proportion of p-values that are less than 50%, treating this statistic as the outcome of a binomial process with constant probability 0.5, and determining whether we have a statistically significantly under- or over-representation of such p-values noting that, under the null hypothesis, we'd expect half of the p-values to be under 50% and half over 50%. With 58 of the 100 observed p-values coming in under 50% we get a p-value for this binomial test of 7%.
Finally, we can lend another sliver of support to the idea of momentum - in a slightly roundabout manner - by performing similar calculations with the p-values from the righthand side of the table above, which are p-values where the alternative hypothesis is that we've witnessed too many runs. The Fisher statistic for these data yield a p-value of 100% and the binomial on the number of p-values less than 50% is 99.9%, both of which are so supportive of the null hypothesis as to imply that we've maybe "chosen the wrong tail" to look at. We should note, however, that the same effect which tends to push the p-values higher for the left-tail test also pushes the p-values higher for the right-tail test, because in both cases we're including the probability associated with the actual observed number of runs in the p-value.
The Verdict on Scoring Momentum for Teams
In short, the evidence is that team scoring streaks are about what we'd expect them to be if momentum did not exist, though there might be some traces of momentum in a handful of games.
Perhaps the best way to put all of this complex statistical analysis in perspective is to look at the effect size of the phenomenon we're dealing with here and to note that the average difference across the 100 games in the sample between the observed and the expected number of runs under the null hypothesis is just 0.7 runs per game. When you consider that the average game has just over 24 scoring runs, that's a tiny if-at-all-existent difference.
What About Momentum in Scoring Type?
We can also ask of the scoring progression data whether or not there's evidence that goals tend to be followed by goals and behinds by behinds, regardless of which team scores them, or whether, instead, there's evidence that goals beget behinds and behinds beget goals - or whether there's no pattern at all to the sequence of scoring.
The following table was created in the same way as the previous table except this time, rather than looking at whether the Home or the Away team scored, we look at whether the score was a goal or a behind, regardless of which team scored it.