问题
Given an R data frame like this:
DF.a <- data.frame(ID1 = c("A","B","C","D","E","F","G","H"),
ID2 = c("D",NA,"G",NA,NA,NA,"H",NA),
ID3 = c("F",NA,NA,NA,NA,NA,NA,NA))
> DF.a
ID1 ID2 ID3
1 A D F
2 B <NA> <NA>
3 C G <NA>
4 D <NA> <NA>
5 E <NA> <NA>
6 F <NA> <NA>
7 G H <NA>
8 H <NA> <NA>
I would like to simplify/reshape it into the following:
DF.b <- data.frame(ID1 = c("A","B","C","E"),
ID2 = c("D",NA,"G",NA),
ID3 = c("F",NA,"H",NA))
> DF.b
ID1 ID2 ID3
1 A D F
2 B <NA> <NA>
3 C G H
4 E <NA> <NA>
It does not seem like a straightforward reshape. The goal is to get all "connected" ID values together on a single row. Note how the connection between "C" and "H" is indirect, as both are connected to "G", but they don't appear together on the same row of DF.a. The order of the ID values in rows of DF.b does not matter.
回答1:
Really you could think of this as trying to get all the connected components of a graph. The first step I would take would be to convert your data into a more natural structure -- a vector of nodes and matrix of edges:
(nodes <- as.character(sort(unique(unlist(DF.a)))))
# [1] "A" "B" "C" "D" "E" "F" "G" "H"
(edges <- do.call(rbind, apply(DF.a, 1, function(x) {
x <- x[!is.na(x)]
cbind(head(x, -1), tail(x, -1))
})))
# [,1] [,2]
# ID1 "A" "D"
# ID2 "D" "F"
# ID1 "C" "G"
# ID1 "G" "H"
Now you are ready to build a graph and compute its components:
library(igraph)
g <- graph.data.frame(edges, FALSE, nodes)
(comp <- split(nodes, components(g)$membership))
# $`1`
# [1] "A" "D" "F"
#
# $`2`
# [1] "B"
#
# $`3`
# [1] "C" "G" "H"
#
# $`4`
# [1] "E"
The output of the split
function is a list, where each list element is all the nodes in one of the components of the graph. Personally I think this is the most useful representation of the output data, but if you really wanted the NA-padded structure you describe you could try something like:
max.len <- max(sapply(comp, length))
do.call(rbind, lapply(comp, function(x) { length(x) <- max.len ; x }))
# [,1] [,2] [,3]
# 1 "A" "D" "F"
# 2 "B" NA NA
# 3 "C" "G" "H"
# 4 "E" NA NA
来源:https://stackoverflow.com/questions/32377965/gather-connected-ids-across-different-rows-of-data-frame