This is the first post of a series on network visualisation.
Thanks to the facilitated access to network analysis tools and the growing interest in many disciplines towards studying the relations structuring datasets, networks have become ubiquitous objects in science, in newspapers, on tech book covers, all over the Web, and to illustrate anything big data-related (hand in hand with word clouds.). Unfortunately, the resort to networks has reached a point where in a conference I heard a speaker say:
“Since this is mandatory, here is a network visualisation of these data. Sorry if you cannot see anything in this big hairball.“
You would expect in a conference that everything presented has a purpose. Sadly, it seems that there is underlying pressure in scientific communities to create such horrors.
A network is easy to create, easy to draw, easy to export, and usually nobody ask questions, because they are often difficult to grasp. This could be different.
Question the relevance
Frankly, who isn’t bored by network visualisations appearing in talks or peer-reviewed journals where you cannot “read” anything? by slides that do not generate questions but bring the discussion to a close?
To present it more clearly: when a network is more than, say, thirty nodes((That’s low, and it is worse in cheaply printed journals/books or on slides projected far from an audience.)), it is often difficult to find answers to legitimate questions like:
- What is the structure under that layer of edges darkening everything? Am I allowed to draw any conclusion from that figure? (Spoiler)
- For any given node, which nodes is it connected to?
- Which layout algorithm was used to position the nodes? Are the nodes in the middle also the nodes in the center?
- What/where are the labels? (Since most of the time they are missing, unreadable or overlapping.)
- Curved edges, really?
- What do node colours represent((Why did the author use 1990s pixellated spheres instead of filled in circles? Note that these artefacts from the past still appear from time to time.))? Which community detection algorithm was selected and what is the modularity score?
Answers to these questions must accompany the figure((There are also critiques to address to works omitting the methods, for example not explaining how relations are defined.)), either orally (in a talk), by appearing in the legend, or simply by being integrated in the design. And in published cases, the network should be readable independently.
Nevertheless, sometimes it is simply impossible to achieve this: because of space, because of printing quality limitations, because of the size of the network… Those are hints that perhaps we should not include that motley saturated network “visualisation” in our paper.
The aim of network modelling is to render and allow to study the structural features of a group of objects (actors, words, cells, places, etc). This means putting emphasis on the relations between the entities of the said group. If the network is large, such an analysis clearly does not require any visualisation((I plead guilty.)), only network metrics, perhaps simulations, etc. Meaning for the subsequent paper the display of tables and diagrams of micro-structures.
Here are a few reasons that may tempt to include an irrelevant network visualisation:
- When wanting to suggest the size or density of a network.
- When unsure about metrics, unwilling to use any.
- When building the network is the result and not a step.
- Because there was a way to interpret the data set as relational…
Since a burst at the very end of 20th century((Commonly attributed in state of the art sections to Watts and Strogatz (1998) and Albert and Barabasi (1999).)), there has been an ever-growing passion for networks, which is a great thing for methodological and interdisciplinary reasons. Software and hardware have become more reliable, the entry cost being constantly lowered. Networks have been generalised, and that is a great thing. However, all this has come with a cost, and I believe that the scientific community needs to maintain a high level of requirement by questioning the relevance of network visualisations, as they may lower the debate rather than improve it.
A few suggestions to resolve the previous criticisms:
- Visualising large networks may provide insight, but should remain an intermediary result. Include them only if they add to the understanding.
- If the network is too dense and not too large, increase the minimal link distance, diminish the node size, and move the node labels away.
- If the network is too dense and/or too large, compute a subnetwork based on an edge weight threshold((See for example Serrano et al. (2009).)), or contract the network to densely connected components and mention the method you used.
- Explain which layout algorithm is used. This helps to interpret the visualisation (wherever possible). Use a layout algorithm that keeps the variance of link distance distribution low while being efficient at minimising the number of edge crossings ((Let’s start by banishing circular layouts.)).
- Draw edges straight. Draw arcs curved only if there are reciprocal arcs((See also this short discussion by Elijah Meeks and his whole presentation.)).
- Have legends explaining the size of nodes and the width of edges.
- Have you considered interactive networks?
I hope this post will prove to be helpful. I’ll be pleased to hear comments and propositions.
On the next episode…
… I will discuss interactivity and layout algorithms. Moreover, I will provide a tutorial to create networks like the tiny one below((Made with the networkD3 R package.)) (please, click on it!). Interactive networks allow the reader to bypass obstacles and thus to solve many problems.
Update (nov. 11 2015). I’ve been reminded me of this post introducing hive plots (thanks Ioannis). Hive plots are an attempt to visualise large networks with suitable node attributes. In particular, the authors’ criticisms about visualisation of large networks are absolutely relevant (and coming with creepy examples). While in many cases I do not believe that a visualisation is necessary, interactive versions of hive plots are promising. See also this discussion. By the way, this is also the case with interactive network visualisations.