问题
I have a table like this,
> head(dt2)
Weight Height Fitted interval limit value
1 65.6 174.0 71.91200 pred lwr 53.73165
2 80.7 193.5 91.63237 pred lwr 73.33198
3 72.6 186.5 84.55326 pred lwr 66.31751
4 78.8 187.2 85.26117 pred lwr 67.02004
5 74.8 181.5 79.49675 pred lwr 61.29244
6 86.4 184.0 82.02501 pred lwr 63.80652
I want it to have like this,
> head(reshape2::dcast(dt2,
Weight + Height + Fitted + interval ~ limit,
fun.aggregate = mean))
Weight Height Fitted interval lwr upr
1 42.0 153.4 51.07920 conf 49.15463 53.00376
2 42.0 153.4 51.07920 pred 32.82122 69.33717
3 43.2 160.0 57.75378 conf 56.35240 59.15516
4 43.2 160.0 57.75378 pred 39.54352 75.96404
5 44.8 149.5 47.13512 conf 44.87642 49.39382
6 44.8 149.5 47.13512 pred 28.83891 65.43133
But using tidyr::spread
, How can I do that?
I was using,
> tidyr::spread(dt2, limit, value)
But getting the error,
Error: Duplicate identifiers for rows (1052, 1056), (238, 242), (1209, 1218), (395, 404), (839, 1170), (25, 356), (1173, 1203, 1215), (359, 389, 401), (1001, 1200), (187, 386), (906, 907), (92, 93), (930, 1144), (116, 330), (958, 1171), (144, 357), (902, 1018), (88, 204), (960, 1008), (146, 194), (1459, 1463), (645, 649), (1616, 1625), (802, 811), (1246, 1577), (432, 763), (1580, 1610, 1622), (766, 796, 808), (1408, 1607), (594, 793), (1313, 1314), (499, 500), (1337, 1551), (523, 737), (1365, 1578), (551, 764), (1309, 1425), (495, 611), (1367, 1415), (553, 601)
Random 10 Rows::
> dt[sample(nrow(dt), 10), ]
Weight Height Fitted interval limit value
1253 52.2 162.5 60.28203 conf upr 61.51087
426 49.1 158.8 56.54022 pred upr 74.75756
1117 78.4 184.5 82.53066 conf lwr 80.98778
1171 85.9 166.4 64.22611 conf lwr 63.21254
948 61.4 177.8 75.75494 conf lwr 74.66393
384 90.9 172.7 70.59731 pred lwr 52.41828
289 75.9 172.7 70.59731 pred lwr 52.41828
3 44.8 149.5 47.13512 pred lwr 28.83891
774 87.3 182.9 80.91258 pred upr 99.12445
772 86.4 175.3 73.22669 pred upr 91.40919
回答1:
Let's say you were starting with data that looked like this:
mydf
# Weight Height Fitted interval limit value
# 1 42 153.4 51.0792 conf lwr 49.15463
# 2 42 153.4 51.0792 pred lwr 32.82122
# 3 42 153.4 51.0792 conf upr 53.00376
# 4 42 153.4 51.0792 pred upr 69.33717
# 5 42 153.4 51.0792 conf lwr 60.00000
# 6 42 153.4 51.0792 pred lwr 90.00000
Notice the duplication in rows 5 and 6 of the grouping columns (1 to 5). This is essentially what "tidyr" is telling you. The first row and fifth are duplicates, as are the second and sixth.
tidyr::spread(mydf, limit, value)
# Error: Duplicate identifiers for rows (1, 5), (2, 6)
As suggested by @Jaap, the solution is to first "summarise" the data. Since "tidyr" is only for reshaping data (unlike "reshape2", which aggregated and reshaped), you need to perform the aggregation with "dplyr" before you change the data form. Here, I've done that with summarise
for the "value" column.
If you stopped the execution at the summarise
step, you would find that our original 6-row dataset had "shrunk" to 4 rows. Now, spread
would work as expected.
mydf %>%
group_by(Weight, Height, Fitted, interval, limit) %>%
summarise(value = mean(value)) %>%
spread(limit, value)
# Source: local data frame [2 x 6]
#
# Weight Height Fitted interval lwr upr
# (dbl) (dbl) (dbl) (chr) (dbl) (dbl)
# 1 42 153.4 51.0792 conf 54.57731 53.00376
# 2 42 153.4 51.0792 pred 61.41061 69.33717
This matches the expected output from dcast
with fun.aggregate = mean
.
reshape2::dcast(mydf, Weight + Height + Fitted + interval ~ limit, fun.aggregate = mean)
# Weight Height Fitted interval lwr upr
# 1 42 153.4 51.0792 conf 54.57731 53.00376
# 2 42 153.4 51.0792 pred 61.41061 69.33717
Sample data:
mydf <- structure(list(Weight = c(42, 42, 42, 42, 42, 42), Height = c(153.4,
153.4, 153.4, 153.4, 153.4, 153.4), Fitted = c(51.0792, 51.0792,
51.0792, 51.0792, 51.0792, 51.0792), interval = c("conf", "pred",
"conf", "pred", "conf", "pred"), limit = structure(c(1L, 1L,
2L, 2L, 1L, 1L), .Label = c("lwr", "upr"), class = "factor"),
value = c(49.15463, 32.82122, 53.00376, 69.33717, 60,
90)), .Names = c("Weight", "Height", "Fitted", "interval",
"limit", "value"), row.names = c(NA, 6L), class = "data.frame")
回答2:
Here are data.table
alternatives to dplyr
. Use mydf
from Ananda's answer.
library(data.table)
library(magrittr)
library(tidyr)
DT <- data.table(mydf)
First, you can use by
to compute the mean by each limit.
DT[, .(lwr = mean(value[limit == "lwr"]),
upr = mean(value[limit == "upr"])),
by = .(Weight, Height, Fitted, interval)]
If this limit == ...
looks too much hard coding, you can first aggregate into a long format, then spread
. This works because once you aggregate, there is no duplicate.
DT[, .(value = mean(value)), by = .(Weight, Height, Fitted, interval, limit)] %>%
spread(key = "limit", value = "value")
Both gets you
# Weight Height Fitted interval lwr upr
#1: 42 153.4 51.0792 conf 54.57731 53.00376
#2: 42 153.4 51.0792 pred 61.41061 69.33717
来源:https://stackoverflow.com/questions/35225052/spread-vs-dcast