问题
Ok, second R question in quick succession.
My data:
Timestamp St_01 St_02 ...
1 2008-02-08 00:00:00 26.020 25.840 ...
2 2008-02-08 00:10:00 25.985 25.790 ...
3 2008-02-08 00:20:00 25.930 25.765 ...
4 2008-02-08 00:30:00 25.925 25.730 ...
5 2008-02-08 00:40:00 25.975 25.695 ...
...
Basically normally I would use a combination of ddply
and summarize
to calculate ensembles (e.g. mean for every hour across the whole year).
In the case above, I would create a category, e.g. hour (e.g. strptime(data$Timestamp,"%H") -> data$hour
and then use that category in ddply
, like ddply(data,"hour", summarize, St_01=mean(St_01), St_02=mean(St_02)...)
to average by category across each of the columns.
but here is where it gets sticky. I have more than 40 columns to deal with and I'm not prepared to type them all one by one as parameters to the summarize
function. I used to write a loop in shell to generate this code but that's not how programmers solve problems is it?
So pray tell, does anyone have a better way of achieving the same result but with less keystrokes?
回答1:
You can use numcolwise()
to run a summary over all numeric columns.
Here is an example using iris
:
ddply(iris, .(Species), numcolwise(mean))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
Similarly, there is catcolwise()
to summarise over all categorical columns.
See ?numcolwise
for more help and examples.
EDIT
An alternative approach is to use reshape2
(proposed by @gsk3). This has more keystrokes in this example, but gives you enormous flexibility:
library(reshape2)
miris <- melt(iris, id.vars="Species")
x <- ddply(miris, .(Species, variable), summarize, mean=mean(value))
dcast(x, Species~variable, value.var="mean")
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
回答2:
You can even simplify the second approach proposed by Andrie by omitting the ddply call completely. Just specify mean
as the aggregation function in the dcast call:
library(reshape2)
miris <- melt(iris, id.vars="Species")
dcast(miris, Species ~ variable, mean)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
The same result can also be calculated very fast using the data.table
package. The .SD
variable in the j expression is a special data.table variable containing the subset of data for each group, excluding all columns used in by
.
library(data.table)
dt_iris <- as.data.table(iris)
dt_iris[, lapply(.SD, mean), by = Species]
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1: setosa 5.006 3.428 1.462 0.246
2: versicolor 5.936 2.770 4.260 1.326
3: virginica 6.588 2.974 5.552 2.026
Yet another option would be the new version 0.2 of Hadley's dplyr
package
library(dplyr)
group_by(iris, Species) %>% summarise_each(funs(mean))
Source: local data frame [3 x 5]
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
来源:https://stackoverflow.com/questions/10787640/ddply-summarize-for-repeating-same-statistical-function-across-large-number-of