Gather connected IDs across different rows of data frame

℡╲_俬逩灬. 提交于 2019-12-12 00:29:14

问题


Given an R data frame like this:

DF.a <- data.frame(ID1 = c("A","B","C","D","E","F","G","H"), 
                   ID2 = c("D",NA,"G",NA,NA,NA,"H",NA), 
                   ID3 = c("F",NA,NA,NA,NA,NA,NA,NA))

> DF.a
  ID1  ID2  ID3
1   A    D    F
2   B <NA> <NA>
3   C    G <NA>
4   D <NA> <NA>
5   E <NA> <NA>
6   F <NA> <NA>
7   G    H <NA>
8   H <NA> <NA>

I would like to simplify/reshape it into the following:

DF.b <- data.frame(ID1 = c("A","B","C","E"),
                   ID2 = c("D",NA,"G",NA),
                   ID3 = c("F",NA,"H",NA))

> DF.b
  ID1  ID2  ID3
1   A    D    F
2   B <NA> <NA>
3   C    G    H
4   E <NA> <NA>

It does not seem like a straightforward reshape. The goal is to get all "connected" ID values together on a single row. Note how the connection between "C" and "H" is indirect, as both are connected to "G", but they don't appear together on the same row of DF.a. The order of the ID values in rows of DF.b does not matter.


回答1:


Really you could think of this as trying to get all the connected components of a graph. The first step I would take would be to convert your data into a more natural structure -- a vector of nodes and matrix of edges:

(nodes <- as.character(sort(unique(unlist(DF.a)))))
# [1] "A" "B" "C" "D" "E" "F" "G" "H"
(edges <- do.call(rbind, apply(DF.a, 1, function(x) {
   x <- x[!is.na(x)]
   cbind(head(x, -1), tail(x, -1))
})))
#     [,1] [,2]
# ID1 "A"  "D" 
# ID2 "D"  "F" 
# ID1 "C"  "G" 
# ID1 "G"  "H"

Now you are ready to build a graph and compute its components:

library(igraph)
g <- graph.data.frame(edges, FALSE, nodes)
(comp <- split(nodes, components(g)$membership))
# $`1`
# [1] "A" "D" "F"
# 
# $`2`
# [1] "B"
# 
# $`3`
# [1] "C" "G" "H"
# 
# $`4`
# [1] "E"

The output of the split function is a list, where each list element is all the nodes in one of the components of the graph. Personally I think this is the most useful representation of the output data, but if you really wanted the NA-padded structure you describe you could try something like:

max.len <- max(sapply(comp, length))
do.call(rbind, lapply(comp, function(x) { length(x) <- max.len ; x }))
#   [,1] [,2] [,3]
# 1 "A"  "D"  "F" 
# 2 "B"  NA   NA  
# 3 "C"  "G"  "H" 
# 4 "E"  NA   NA  


来源:https://stackoverflow.com/questions/32377965/gather-connected-ids-across-different-rows-of-data-frame

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!