问题
I have a large data.frame (20000+ entries) in this format:
id D1 D2
1 0.40 0.21
1 0.00 0.00
1 0.53 0.20
2 0.17 0.17
2 0.25 0.25
2 0.55 0.43
Where each id may be duplicated 3-20 times. I would like to merge the duplicated rows into new columns, so my new data.frame looks like:
id D1 D2 D3 D4 D5 D6
1 0.40 0.21 0.00 0.00 0.53 0.20
2 0.17 0.17 0.25 0.25 0.55 0.43
I've manipulated data.frames before with plyr, but I'm not sure how to approach this problem. Any help would be appreciated.Thanks.
回答1:
The best option would be to just use melt
and dcast
from "reshape2". But before we jump to that option, let's see what else we have available to us:
You mention that the number of rows per "id" is unbalanced. That would make it somewhat difficult to put into a tidy rectangular data.frame
.
Here are a few examples.
Balanced data: Three rows per "id"
mydf <- structure(list(id = c(1, 1, 1, 2, 2, 2),
D1 = c(0.4, 0, 0.53, 0.17, 0.25, 0.55),
D2 = c(0.21, 0, 0.2, 0.17, 0.25, 0.43)),
.Names = c("id", "D1", "D2"), row.names = c(NA, 6L),
class = "data.frame")
mydf
# id D1 D2
# 1 1 0.40 0.21
# 2 1 0.00 0.00
# 3 1 0.53 0.20
# 4 2 0.17 0.17
# 5 2 0.25 0.25
# 6 2 0.55 0.43
With such data, you can just use aggregate
:
do.call(data.frame, aggregate(. ~ id, mydf, as.vector))
# id D1.1 D1.2 D1.3 D2.1 D2.2 D2.3
# 1 1 0.40 0.00 0.53 0.21 0.00 0.20
# 2 2 0.17 0.25 0.55 0.17 0.25 0.43
Unbalanced data: Some workarounds
If you've added a fourth value for "id = 2", aggregate
won't work here:
mydf[7, ] <- c(2, .44, .33)
do.call(data.frame, aggregate(. ~ id, mydf, as.vector))
# Error in data.frame(`0` = c(0.4, 0, 0.53), `1` = c(0.17, 0.25, 0.55, 0.44 :
# arguments imply differing number of rows: 3, 4
It might be best to just have a list
of the resulting vector
s:
lapply(split(mydf[-1], mydf[[1]]), function(x) unlist(x, use.names=FALSE))
# $`1`
# [1] 0.40 0.00 0.53 0.21 0.00 0.20
#
# $`2`
# [1] 0.17 0.25 0.55 0.44 0.17 0.25 0.43 0.33
#
Or, if you insist on a rectangular data.frame
, explore one of the several tools to rbind
unbalanced data, for example, rbind.fill
from "plyr":
library(plyr)
rbind.fill(lapply(split(mydf[-1], mydf[[1]]),
function(x) data.frame(t(unlist(x, use.names=FALSE)))))
# X1 X2 X3 X4 X5 X6 X7 X8
# 1 0.40 0.00 0.53 0.21 0.00 0.20 NA NA
# 2 0.17 0.25 0.55 0.44 0.17 0.25 0.43 0.33
Unbalanced data: A more direct approach
Alternatively, you can use melt
and dcast
from "reshape2" as follows:
library(reshape2)
x <- melt(mydf, id.vars = "id")
## ^^ That's not enough information for `dcast`
## We need a "time" variable too, so use `ave`
## to create one according to the number of
## values per ID.
x$time <- ave(x$id, x$id, FUN = seq_along)
## ^^ I would probably actually stop at this point.
## Long data with proper ID and "time" values
## tend to be easier to work with and many
## other functions in R work more nicely with
## this long data format.
dcast(x, id ~ time, value.var = "value")
# id 1 2 3 4 5 6 7 8
# 1 1 0.40 0.00 0.53 0.21 0.00 0.20 NA NA
# 2 2 0.17 0.25 0.55 0.44 0.17 0.25 0.43 0.33
来源:https://stackoverflow.com/questions/17894228/transpose-duplicated-rows-to-column-in-r