Producing a Sankey diagram with the networkD3 package

A Sankey diagram is a really powerful way of displaying your data. Particularly, Sankey diagrams are a really convenient way of showing flows of data from their origin to their end.

A really famous example of these kind of diagrams is the one presented by Charles Minard's 1869 chart showing the number of men in Napoleon's 1812 Russian campaign army, their movements, as well as the temperature they encountered on the return path:

In a Sankey diagram, a given amount is shown on the leftmost side of the plot and, while moving to the right (which can be interpreted as the flow of time), this given amount is split into parts or simply reduced. The latter is the case for the Minard's diagram, where soldiers died during the campaign and the number of deaths are counted in a separate line plot at the bottom.

Getting ready

In order to get started with this recipe, you will need to install and load the networkD3 and jsonlite packages:

install.packages(c("networkD3","jsonlite"))
library(networkD3)
library(jsonlite)

The first package is the one which implements the sankeyNetwork() function that we will leverage in our recipe, while the second one is simply required to parse the dataset we will use from the JSON format to the data frame.

Our example will regard energy flow from production to the final usage or waste, using the original dataset provided by Christopher Gandrud, creator of the networkD3 package.

In order to make this dataset available, we first need to download it and then convert it from jsonlite to an ordinary list:

  1. Define a URL object pointing to the data source:
    URL <- paste0("https://cdn.rawgit.com/christophergandrud/networkD3/",
      "master/JSONdata/energy.json")
  2. Define an Energy object where we can store the data from the defined source:
    Energy <- jsonlite::fromJSON(URL)

This Energy list will now be composed by two data frames; one for nodes (that is, vertex) and one for links (that is, edges):

List of 2
 $ nodes:'data.frame': 48 obs. of  1 variable:
  ..$ name: chr [1:48] "Agricultural 'waste'" "Bio-conversion" "Liquid" "Losses" ...
 $ links:'data.frame': 68 obs. of  3 variables:
  ..$ source: int [1:68] 0 1 1 1 1 6 7 8 10 9 ...
  ..$ target: int [1:68] 1 2 3 4 5 2 4 9 9 4 ...
  ..$ value : num [1:68] 124.729 0.597 26.862 280.322 81.144 …

The latter data frame is a list of weighted hedges, where a starting point and an end point are exposed, and this link is weighted by a value attribute.

If you are willing to apply this recipe to your data, which I hope you are, you should have them arranged within two distinct data frames with the following structure:

  • The nodes data frame:
    • Nodes: This is a vector of all your nodes' names, with no duplications
  • The links data frame:
    • From: This is a numeric column showing the first node of every connection in your diagram. Let's say that if you want to introduce a connection between the first and the third nodes defined with Nodes, you should write 0 here (and 3 within the to argument) as shown in this example; be aware that first node has value 0 and not 1
    • To: This is a numeric column showing the end of every connection in your diagram
    • Weight: This is the value of your connection, meaning how much of your flow passes through this connection

It may be useful to you to underline that the second data frame is rightly named a hedge list, where each observation represents a hedge of your network.

How to do it...

  1. Produce the Sankey diagram, as follows:
    sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
      Target = "target", Value = "value", NodeID = "name",
      units = "TWh", fontSize = 12, nodeWidth = 30)

    This will result in the following Sankey diagram:

    How to do it...
  2. We will now adjust the font size by changing the value of the fontSize parameter:
    sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
      Target = "target", Value = "value", NodeID = "name",
      units = "TWh", fontSize = 10, nodeWidth = 30)
  3. Next, we change nodeWidth:
    sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
      Target = "target", Value = "value", NodeID = "name",
      units = "TWh", fontSize = 12, nodeWidth = 5)
  4. In order to embed your Sankey diagram, you can leverage the RStudio Save as Web Page control from the Export menu:
    How to do it...

This control will let you save your diagram as an HTML file.

How it works...

In step 1 we call the sankeyNetwork() function, which will produce an interactive Sankey diagram in your RStudio Viewer pane, where the node alignment can be customized and flows can be highlighted by clicking on them.

In step 4 we save your Sankey diagram as a web page, which will let you embed on websites, preserving interactive features.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset