问题
Currently I am trying to create an interactive Sankey with the networkD3
Package following the instructions by Chris Grandrud (https://christophergandrud.github.io/networkD3/).
What I don't understand is is table-format, since he just uses two columns for visualising more transitions. To be more specific, I have a dataset containing four columns which represent 4 years. Inside these columns are different hotel names, whereas each row represents one customer, who is "tracked" over these four years.
URL <- paste0(
"https://cdn.rawgit.com/christophergandrud/networkD3/",
"master/JSONdata/energy.json")
Energy <- jsonlite::fromJSON(URL)
sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
units = "TWh", fontSize = 12, nodeWidth = 30)
To give you an overview of my data here is a screenshot:
I would give you more "coded" information but since I am very new to the topic of R I hope you can follow my train of thoughts in this problem. If not, please do not hesistate to question it.
Thank you :)
回答1:
you need two dataframes: one listing all nodes (containing the names) and one listing the links. The latter contains three columns, the source node, the target node and some value, indicating the strength or width of the link. In the links dataframe you refer to the nodes by the (zero-based) position in the nodes dataframe.
Assuming you data looks like:
df <- data.frame(Year1=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
Year2=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
Year3=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
Year4=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
stringsAsFactors = FALSE)
For the diagram you need to differentiate not only between the hotels but between the hotel/year combination since each of them should be one node:
df$Year1 <- paste0("Year1_", df$Year1)
df$Year2 <- paste0("Year2_", df$Year2)
df$Year3 <- paste0("Year3_", df$Year3)
df$Year4 <- paste0("Year4_", df$Year4)
the links are the "transitions" between the hotels from one year to the next:
library(dplyr)
trans1_2 <- df %>% group_by(Year1, Year2) %>% summarise(sum=n())
trans2_3 <- df %>% group_by(Year2, Year3) %>% summarise(sum=n())
trans3_4 <- df %>% group_by(Year3, Year4) %>% summarise(sum=n())
colnames(trans1_2)[1:2] <- colnames(trans2_3)[1:2] <- colnames(trans3_4)[1:2] <- c("source","target")
links <- rbind(as.data.frame(trans1_2),
as.data.frame(trans2_3),
as.data.frame(trans3_4))
finally, the dataframes need to be referenced to each other:
nodes <- data.frame(name=unique(c(links$source, links$target)))
links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1
Then the diagram can be drawn:
library(networkD3)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "sum", NodeID = "name",
fontSize = 12, nodeWidth = 30)
There might be more elegant solutions, but this could be a starting point for your problem. If you don't like the "Year..." in the nodes' names you con remove them after setting up the dataframes.
回答2:
This question comes up a lot... how to convert a dataset that has multiple links/edges defined on each row across several columns. Here's how I convert that into the type of dataset that sankeyNetwork
(and many other packages that deal with edges/links/network data) uses... a dataset with one edge/link per row.
starting with an example dataset...
df <- read.csv(header = TRUE, as.is = TRUE, text = '
name,year1,year2,year3,year4
Bob,Hilton,Sheraton,Westin,Hyatt
John,Four Seasons,Ritz-Carlton,Westin,Sheraton
Tom,Ritz-Carlton,Westin,Sheraton,Hyatt
Mary,Westin,Sheraton,Four Seasons,Ritz-Carlton
Sue,Hyatt,Ritz-Carlton,Hilton,Sheraton
Barb,Hilton,Sheraton,Ritz-Carlton,Four Seasons
')
# name year1 year2 year3 year4
# 1 Bob Hilton Sheraton Westin Hyatt
# 2 John Four Seasons Ritz-Carlton Westin Sheraton
# 3 Tom Ritz-Carlton Westin Sheraton Hyatt
# 4 Mary Westin Sheraton Four Seasons Ritz-Carlton
# 5 Sue Hyatt Ritz-Carlton Hilton Sheraton
# 6 Barb Hilton Sheraton Ritz-Carlton Four Seasons
- create a row number so that you'll still be able to determine which row/observation each individual link came from when you convert the data to long format
- use
tidyr
'sgather()
function to convert the dataset to long format - convert the column name variable to the index/number of the column in the original dataset
- grouped by row (each observation in the original dataset), order each node by the column it was in, and create a variable for its "target" by setting it to the node from the column after it
- filter out any rows that have
NA
for "target" (nodes in the last column of the original dataset will not have a "target", and therefore those rows do not specify a link)
library(dplyr)
library(tidyr)
links <-
df %>%
mutate(row = row_number()) %>%
gather('column', 'source', -row) %>%
mutate(column = match(column, names(df))) %>%
group_by(row) %>%
arrange(column) %>%
mutate(target = lead(source)) %>%
ungroup() %>%
filter(!is.na(target))
# # A tibble: 24 x 4
# row column source target
# <int> <int> <chr> <chr>
# 1 1 1 Bob Hilton
# 2 2 1 John Four Seasons
# 3 3 1 Tom Ritz-Carlton
# 4 4 1 Mary Westin
# 5 5 1 Sue Hyatt
# 6 6 1 Barb Hilton
# 7 1 2 Hilton Sheraton
# 8 2 2 Four Seasons Ritz-Carlton
# 9 3 2 Ritz-Carlton Westin
# 10 4 2 Westin Sheraton
# # ... with 14 more rows
Now the data is already in the typical network data format of one link per row defined by "source" and "target" columns, and it could be used with the sankeyNetwork()
. However, you will likely want nodes referring to the same thing appearing multiple times within your plot... if someone visited the Hilton in year 1, and then visited the Hilton again in year 3, you will probably want 2 separate nodes, both named Hilton, but appearing in different parts of the plot. In order to do that, you will have to identify each node in your "source" and "target" columns with the year in which they were visited. That's where keeping the "row" and "column" variables around will come in handy.
Append the column index to the "source" name, and append the column index + 1 to the "target" name, and now you will be able to distinguish, for instance, between the node for Hilton which was visited in year 1 and the node for Hilton that was visited in year 3
links <-
links %>%
mutate(source = paste0(source, '_', column)) %>%
mutate(target = paste0(target, '_', column + 1)) %>%
select(source, target)
# # A tibble: 24 x 2
# source target
# <chr> <chr>
# 1 Bob_1 Hilton_2
# 2 John_1 Four Seasons_2
# 3 Tom_1 Ritz-Carlton_2
# 4 Mary_1 Westin_2
# 5 Sue_1 Hyatt_2
# 6 Barb_1 Hilton_2
# 7 Hilton_2 Sheraton_3
# 8 Four Seasons_2 Ritz-Carlton_3
# 9 Ritz-Carlton_2 Westin_3
# 10 Westin_2 Sheraton_3
# # ... with 14 more rows
Now you can follow the rather standard procedure for using a source-target list of links to build the necessary data frames for sankeyNetwork()
. Create a nodes
data frame with all the unique nodes found in the "source" and "target" vectors. Convert the "source" and "target" vectors in the links
data frame to be the 0-based-index of the node in the nodes
data frame. Add an arbitrary value for each link in the links
data frame since it's required by sankeyNetwork()
. Now you can remove the appended column index from the node names in the nodes
data frame because they will only be used to label the nodes in the resulting plot (so it no longer matters if they are unique). Then plot it with sankeyNetwork()
!
nodes <- data.frame(name = unique(c(links$source, links$target)))
links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1
links$value <- 1
nodes$name <- sub('_[0-9]+$', '', nodes$name)
library(networkD3)
library(htmlwidgets)
sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
Target = 'target', Value = 'value', NodeID = 'name')
来源:https://stackoverflow.com/questions/44132423/creating-a-sankey-diagram-using-networkd3-package-in-r