Only a few times in my professional career as a data scientist have I had the opportunity to use mathematical graph theory, but the technique has long fascinated me.
Briefly, the theory involves "nodes", which are entities like books, teams or streets, and "vertices", which signify relationships between the nodes - such as, in the books example, having the same author. Vertices can denote present/absent relationships such as friendship, or they can denote cardinality such as the number of times a pair of teams have played. Where the relationships between nodes is between them and not from one to the other (eg friendships), the vertices are said to be undirected; where they flow from one node to another they're said to be directed (eg Team A defeated Team B).
In today's blog we'll use graph theory to depict the Twitter Follower networks of 30 Twitter accounts that I follow or am aware of, that have at last 200 Followers (as at 19 February 2017), and that Tweet regularly, if not exclusively, about the AFL competition. The nodes then will be Twitter accounts and the vertices between them based on followership relationships. Strictly speaking followership vertices are directed, but for the purposes of today's analysis I'll be ignoring that.
The objective of the analysis is to explore the nature of the follower groups for each of the 30 accounts, in particular the extent to which they are disjoint or shared, and to investigate whether some higher order structure might be revealed in the pattern of the sets of accounts that followers tend to follower in common. Well, that and to give me a chance to create some colourful and interesting charts using a technique I haven't got to use much ...
In broad terms, the analysis proceeds as follows
- Extract the raw follower data using the rtweet package in R
- Use the igraph package to create an igraph network object containing the nodes and vertices
- Use the cluster_spinglass function to cluster the nodes (ie followers) in the network
- Colour the network vertices on the basis of cluster membership
- Analyse the relationship between the clusters and the 30 Twitter accounts
- Use the Kamada-Kawai layout algorithm in igraph and the ggnet2 extension for ggplot2 to create one view of the network and the identified clusters
- Use the rgexf package to convert the igraph object to a Gephi object (thanks to Tim Bennett, Twitter handle @flashman, for putting me onto the Gephi package.)
- Use Gephi's Force Atlas 2 and Expansion layout tools to create a different view of the network
THE DATA - FOLLOWERSHIP
Altogether, just over 18,000 accounts follow at least one the 30 selected accounts, with the individual follower counts for any single account ranging from just over 2,000 to around 200.
We can get an idea of the raw follower counts and the pairwise co-follower numbers by visualising the cross-tab of the counts as below.
(Please click on this, and on other images in this blog, to access larger versions.)
In this visualisation, larger dots connote a larger number of cross-followers (or, on the diagonals, followers).
We see, for example, that Arwon, BetDetective, and TheArc all have relatively large follower bases, and that the co-followings of TheArc and MoS, JoshPinn and FootyGospel, as well as InsightLane and FiguringFooty are all relatively large. (NB Links to the Twitter profiles of all 30 accounts used in the analysis appear at the bottom of this blog.)
To get a sense of how significant these cross-followings are in terms of the follower bases of the 30 accounts, we can convert these raw counts into proportions, which we depict in the visualisation below.
Here we connote proportions by dot size (larger is higher) and also colour (lighter blue is higher), with the light-blue dots along the diagonal representing a proportion of 100%. Specifically, the size (and colour) of a particular dot represents the proportion of the followers of the account named in the row who also follow the account named in the column.
So, for example, a relatively large proportion of the followers of plusSixOne also follow FiguringFooty.
If we're interested in the actual proportions, we can simply spit out the cross-tab and colour-code it by value (leaving out the 100s on the diagonal to allow for a slightly wider range of colours).
This view, I think, makes clearest of all the surprisingly low levels of cross-followership generally amongst these accounts given that they all have in common, at the very least, an interest in AFL. Even amongst accounts that are subjectively more similar in "content", such as TheArc and FiguringFooty, the co-following rates are only 65% (from FiguringFooty's viewpoint) or 23% (from TheArc's).
A NETWORK, CLUSTERED
Recall that we are, in graph theory terms, defining nodes as Twitter accounts and vertices as the relationship "follows" (so that, for example, the node FMI will have a vertex to it from node User_ID_12345 if the User_ID_12345 account follows FMI). Building our network on this basis, running it through our spinglass clustering routine, laying it out using the Kamada-Kawai algorithm, and then prettying it up in ggnet2, we obtain the visualisation below.
Perhaps the most important part of network analysis is finding a layout that subjectively "works" for the data. Layout algorithms are responsible for moving the nodes around in an attempt to reveal the underlying relationships in the data, and the Kamada-Kawai algorithm seems to have done a reasonable job for us here, but igraph offers a number of alternative layout algorithms that we might also have tried. There is no such thing as an objectively "correct" layout of a network, but some layouts clearly work poorly for some networks.
Here we do have some objective sense of the efficacy of the layout algorithm, however, in that it has performed well in separating the nodes from many of the clusters defined by cluster_spinglass (igraph also offers other clustering, or 'community detection', algorithms too) and in highlighting some of the more distinct co-follower groups.
For example, we can clearly see in red at the bottom of the visualisation the accounts that follow only ASpeedingCar, as well as the individual and shared followers of DownIsNewUp and BetDetective at the top of the visualisation.
In this visualisation, I've also coloured the node labels for the 30 selected accounts with the colour of the cluster to which the account belongs. As such, we can see the commonality of the RankSW, InsightLane, FiguringFooty, HPN, plusSixOne, MoS, SgtB and RyanB follower bases. As you'll see if you review their tweeting history, all of these accounts have a highly quantitative approach in their coverage and discussion of AFL.
A similar, though arguably prettier version of the network emerges once we port the igraph network into Gephi and use some of its layout algorithms as described earlier.
(You can access a PDF version of this image here.)
ANALYSIS THE CLUSTERS
I gave the spinglass algorithm license to create up to 50 clusters (or communities), but it stopped after building just 19 of them.
Consistent with my earlier comment about the relatively disjoint nature of the follower bases, many of the 30 selected Tweeter accounts see a large proportion of their followers coming from a single community.
TheArc, for example, has a large proportion of its follower base in Community 9. No other account sources any significant proportion of its base from this Community. Most accounts, in fact, can be said to exclusively "own" a particular community, the obvious exception being the 10 accounts that seem to "share" Community 8.
This notion of "ownership" is also revealed via a by-community analysis looking at the proportion of each community that follows a particular account.
SUMMARY AND WHAT NEXT?
Limited in scope as it is, I think this analysis shows promise for wider application. I might, for example, redo it with a larger set of accounts or start to investigate the wider Twitter behaviour of some of the followers in the identified communities.
More broadly, I think there are other interesting possibilities in applying network analysis and graph theory to aspects of sports analytics such as team-versus-team result histories in the home-and-away season, or just in finals. I plan to investigate some of these over the next few weeks and during the season
THE 30 ACCOUNTS
Below are links the Twitter profiles of the 30 accounts used in this analysis in case you'd like to check a few of them out.