Creating a Sankey Diagram using NetworkD3 package in R

问题

Currently I am trying to create an interactive Sankey with the networkD3 Package following the instructions by Chris Grandrud (https://christophergandrud.github.io/networkD3/).
What I don't understand is is table-format, since he just uses two columns for visualising more transitions. To be more specific, I have a dataset containing four columns which represent 4 years. Inside these columns are different hotel names, whereas each row represents one customer, who is "tracked" over these four years.

    URL <- paste0(
        "https://cdn.rawgit.com/christophergandrud/networkD3/",
        "master/JSONdata/energy.json")
    Energy <- jsonlite::fromJSON(URL)

    sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
         Target = "target", Value = "value", NodeID = "name",
         units = "TWh", fontSize = 12, nodeWidth = 30)

To give you an overview of my data here is a screenshot:

I would give you more "coded" information but since I am very new to the topic of R I hope you can follow my train of thoughts in this problem. If not, please do not hesistate to question it.

Thank you :)

回答1:

you need two dataframes: one listing all nodes (containing the names) and one listing the links. The latter contains three columns, the source node, the target node and some value, indicating the strength or width of the link. In the links dataframe you refer to the nodes by the (zero-based) position in the nodes dataframe.

Assuming you data looks like:

df <- data.frame(Year1=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 Year2=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 Year3=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 Year4=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 stringsAsFactors = FALSE)

For the diagram you need to differentiate not only between the hotels but between the hotel/year combination since each of them should be one node:

df$Year1 <- paste0("Year1_", df$Year1)
df$Year2 <- paste0("Year2_", df$Year2)
df$Year3 <- paste0("Year3_", df$Year3)
df$Year4 <- paste0("Year4_", df$Year4)

the links are the "transitions" between the hotels from one year to the next:

library(dplyr)
trans1_2 <- df %>% group_by(Year1, Year2) %>% summarise(sum=n())
trans2_3 <- df %>% group_by(Year2, Year3) %>% summarise(sum=n())
trans3_4 <- df %>% group_by(Year3, Year4) %>% summarise(sum=n())

colnames(trans1_2)[1:2] <- colnames(trans2_3)[1:2] <- colnames(trans3_4)[1:2] <- c("source","target")

links <- rbind(as.data.frame(trans1_2), 
               as.data.frame(trans2_3), 
               as.data.frame(trans3_4))

finally, the dataframes need to be referenced to each other:

nodes <- data.frame(name=unique(c(links$source, links$target)))
links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1

Then the diagram can be drawn:

library(networkD3)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
              Target = "target", Value = "sum", NodeID = "name",
              fontSize = 12, nodeWidth = 30)

There might be more elegant solutions, but this could be a starting point for your problem. If you don't like the "Year..." in the nodes' names you con remove them after setting up the dataframes.

回答2:

This question comes up a lot... how to convert a dataset that has multiple links/edges defined on each row across several columns. Here's how I convert that into the type of dataset that sankeyNetwork (and many other packages that deal with edges/links/network data) uses... a dataset with one edge/link per row.

starting with an example dataset...

df <- read.csv(header = TRUE, as.is = TRUE, text = '
name,year1,year2,year3,year4
Bob,Hilton,Sheraton,Westin,Hyatt
John,Four Seasons,Ritz-Carlton,Westin,Sheraton
Tom,Ritz-Carlton,Westin,Sheraton,Hyatt
Mary,Westin,Sheraton,Four Seasons,Ritz-Carlton
Sue,Hyatt,Ritz-Carlton,Hilton,Sheraton
Barb,Hilton,Sheraton,Ritz-Carlton,Four Seasons
')

#   name        year1        year2        year3        year4
# 1  Bob       Hilton     Sheraton       Westin        Hyatt
# 2 John Four Seasons Ritz-Carlton       Westin     Sheraton
# 3  Tom Ritz-Carlton       Westin     Sheraton        Hyatt
# 4 Mary       Westin     Sheraton Four Seasons Ritz-Carlton
# 5  Sue        Hyatt Ritz-Carlton       Hilton     Sheraton
# 6 Barb       Hilton     Sheraton Ritz-Carlton Four Seasons

create a row number so that you'll still be able to determine which row/observation each individual link came from when you convert the data to long format
use tidyr's gather() function to convert the dataset to long format
convert the column name variable to the index/number of the column in the original dataset
grouped by row (each observation in the original dataset), order each node by the column it was in, and create a variable for its "target" by setting it to the node from the column after it
filter out any rows that have NA for "target" (nodes in the last column of the original dataset will not have a "target", and therefore those rows do not specify a link)

library(dplyr)
library(tidyr)

links <-
  df %>%
  mutate(row = row_number()) %>%
  gather('column', 'source', -row) %>%
  mutate(column = match(column, names(df))) %>%
  group_by(row) %>%
  arrange(column) %>%
  mutate(target = lead(source)) %>%
  ungroup() %>%
  filter(!is.na(target))

# # A tibble: 24 x 4
#      row column source       target
#    <int>  <int> <chr>        <chr>
#  1     1      1 Bob          Hilton
#  2     2      1 John         Four Seasons
#  3     3      1 Tom          Ritz-Carlton
#  4     4      1 Mary         Westin
#  5     5      1 Sue          Hyatt
#  6     6      1 Barb         Hilton
#  7     1      2 Hilton       Sheraton
#  8     2      2 Four Seasons Ritz-Carlton
#  9     3      2 Ritz-Carlton Westin
# 10     4      2 Westin       Sheraton
# # ... with 14 more rows

Now the data is already in the typical network data format of one link per row defined by "source" and "target" columns, and it could be used with the sankeyNetwork(). However, you will likely want nodes referring to the same thing appearing multiple times within your plot... if someone visited the Hilton in year 1, and then visited the Hilton again in year 3, you will probably want 2 separate nodes, both named Hilton, but appearing in different parts of the plot. In order to do that, you will have to identify each node in your "source" and "target" columns with the year in which they were visited. That's where keeping the "row" and "column" variables around will come in handy.

Append the column index to the "source" name, and append the column index + 1 to the "target" name, and now you will be able to distinguish, for instance, between the node for Hilton which was visited in year 1 and the node for Hilton that was visited in year 3

links <-
  links %>%
  mutate(source = paste0(source, '_', column)) %>%
  mutate(target = paste0(target, '_', column + 1)) %>%
  select(source, target)

# # A tibble: 24 x 2
#    source         target
#    <chr>          <chr>
#  1 Bob_1          Hilton_2
#  2 John_1         Four Seasons_2
#  3 Tom_1          Ritz-Carlton_2
#  4 Mary_1         Westin_2
#  5 Sue_1          Hyatt_2
#  6 Barb_1         Hilton_2
#  7 Hilton_2       Sheraton_3
#  8 Four Seasons_2 Ritz-Carlton_3
#  9 Ritz-Carlton_2 Westin_3
# 10 Westin_2       Sheraton_3
# # ... with 14 more rows

Now you can follow the rather standard procedure for using a source-target list of links to build the necessary data frames for sankeyNetwork(). Create a nodes data frame with all the unique nodes found in the "source" and "target" vectors. Convert the "source" and "target" vectors in the links data frame to be the 0-based-index of the node in the nodes data frame. Add an arbitrary value for each link in the links data frame since it's required by sankeyNetwork(). Now you can remove the appended column index from the node names in the nodes data frame because they will only be used to label the nodes in the resulting plot (so it no longer matters if they are unique). Then plot it with sankeyNetwork()!

nodes <- data.frame(name = unique(c(links$source, links$target)))

links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1
links$value <- 1

nodes$name <- sub('_[0-9]+$', '', nodes$name)

library(networkD3)
library(htmlwidgets)

sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
              Target = 'target', Value = 'value', NodeID = 'name')

来源：https://stackoverflow.com/questions/44132423/creating-a-sankey-diagram-using-networkd3-package-in-r

标签

plot

sankey-diagram

htmlwidgets

networkd3