问题
Thanks all for your responses and answers. I can see I've unintentionally left out some important details that may help you understand my problem better. I was trying to keep it simple and generic, but that didn't actually help. Here's an updated version with more information.
I have a data.frame
with many columns that came from a NetLogo
model generated by BehaviorSpace
. Each column is a time series that represents a reported value under different experimental conditions with repetitions represented by the run number and time step number. For example (sorry this is long but I'm trying to give you a flavor for the data):
# Start by building a fake data.frame that models some of the characteristics of mine:
df <- data.frame(run = c(rep(1,5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5), rep(7,5), rep(8,5)))
df2 <- expand.grid(step = 1:5, fac.a = c(10,1000), fac.b = c(0.5,2.0))
df <- data.frame(run = df$run, rep = c(rep(1,20), rep(2,20)), step = df2$step, fac.a = df2$fac.a, fac.b = df2$fac.b)
log_growth <- function (a, b, x) {(1/(1+a*exp(-b*x))) + rnorm(1,0,0.2)}
set.seed(11)
df$treatment1 <- log_growth(df$fac.a, df$fac.b, df$step)
df$treatment2 <- log_growth(df$fac.a / 2, df$fac.b * 2, df$step)
This puts the following into df:
> df
run rep step fac.a fac.b treatment1 treatment2
1 1 1 1 10 0.5 0.05288201 0.356176584
2 1 1 2 10 0.5 0.12507561 0.600407158
3 1 1 3 10 0.5 0.22081815 0.804671117
4 1 1 4 10 0.5 0.33627099 0.920093934
5 1 1 5 10 0.5 0.46053940 0.971397427
6 2 1 1 1000 0.5 -0.08700866 0.009396323
7 2 1 2 1000 0.5 -0.08594375 0.018552055
8 2 1 3 1000 0.5 -0.08419297 0.042608835
9 2 1 4 1000 0.5 -0.08131981 0.102435481
10 2 1 5 1000 0.5 -0.07661880 0.232875872
11 3 1 1 10 2.0 0.33627099 0.920093934
12 3 1 2 10 2.0 0.75654214 1.002314651
13 3 1 3 10 2.0 0.88715737 1.003958435
14 3 1 4 10 2.0 0.90800192 1.003988593
15 3 1 5 10 2.0 0.91089154 1.003989145
16 4 1 1 1000 2.0 -0.08131981 0.102435481
17 4 1 2 1000 2.0 -0.03688314 0.860350536
18 4 1 3 1000 2.0 0.19880473 1.000926458
19 4 1 4 1000 2.0 0.66014952 1.003932891
20 4 1 5 1000 2.0 0.86791705 1.003988125
21 5 2 1 10 0.5 0.05288201 0.356176584
22 5 2 2 10 0.5 0.12507561 0.600407158
23 5 2 3 10 0.5 0.22081815 0.804671117
24 5 2 4 10 0.5 0.33627099 0.920093934
25 5 2 5 10 0.5 0.46053940 0.971397427
26 6 2 1 1000 0.5 -0.08700866 0.009396323
27 6 2 2 1000 0.5 -0.08594375 0.018552055
28 6 2 3 1000 0.5 -0.08419297 0.042608835
29 6 2 4 1000 0.5 -0.08131981 0.102435481
30 6 2 5 1000 0.5 -0.07661880 0.232875872
31 7 2 1 10 2.0 0.33627099 0.920093934
32 7 2 2 10 2.0 0.75654214 1.002314651
33 7 2 3 10 2.0 0.88715737 1.003958435
34 7 2 4 10 2.0 0.90800192 1.003988593
35 7 2 5 10 2.0 0.91089154 1.003989145
36 8 2 1 1000 2.0 -0.08131981 0.102435481
37 8 2 2 1000 2.0 -0.03688314 0.860350536
38 8 2 3 1000 2.0 0.19880473 1.000926458
39 8 2 4 1000 2.0 0.66014952 1.003932891
40 8 2 5 1000 2.0 0.86791705 1.003988125
So what I did before is split up the data frame using by
and wanted to obtain averages and standard deviations for every step (it's a time series) and each combination of factors.
After having looked at all your answers and having reconsidered my problem, I think what I'm trying to do would be better handled during the conversion process of by
. I'm not exactly sure how to do that... What I want the output to look like is a summary of sorts:
> df
run fac.a fac.b mean.treatment1 mean.treatment2 sd.treatment1 sd.treatment2
1 1 10 0.5 xxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxxx
1 1 10 2.0 xxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxxx
1 1 1000 0.5 xxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxxx
1 1 1000 2.0 xxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxxx
Is this a job for aggregate
? Thanks for your patience and help. -- Glenn
Original question:
I have a data.frame
with many columns, each of which represents a specific experimental condition with repetitions.
> df <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5), mean.a = 0, mean.b = 0, mean.1 = 0, mean.2 = 0)
> df
a.1 b.1 a.2 b.2 mean.a mean.b sd.a sd.b
1 0.9209433 0.3501444 0.3893140 0.3264827 0 0 0 0
2 0.4171254 0.4883140 0.8282384 0.1215129 0 0 0 0
3 0.2291582 0.9419946 0.4089008 0.5665242 0 0 0 0
4 0.3807868 0.1889066 0.8271075 0.4022014 0 0 0 0
5 0.5863078 0.4991847 0.4082745 0.5637367 0 0 0 0
I want to find means and standard deviations for each condition and repetition. So far the most direct way seems to be:
for (i in c("a.1", "a.2") {df$mean.a <- df$mean.a + df[[i]]}
df$mean.a <- df$mean.a / 2
But I have a lot of columns, and they are all over the place, so this seems really labor intensive and manual. A little nicer method is to use ave()
:
df$mean.a <- with (df, ave(a.1, a.2))
But if I want to do sd() instead, I mysteriously get NAs:
df$sd.a <- with (df, ave(a.1, a.2, FUN = sd))
> df
a.1 b.1 a.2 b.2 mean.a mean.b sd.a sd.b
1 0.9209433 0.3501444 0.3893140 0.3264827 0.9209433 0 NA 0
2 0.4171254 0.4883140 0.8282384 0.1215129 0.4171254 0 NA 0
3 0.2291582 0.9419946 0.4089008 0.5665242 0.2291582 0 NA 0
4 0.3807868 0.1889066 0.8271075 0.4022014 0.3807868 0 NA 0
5 0.5863078 0.4991847 0.4082745 0.5637367 0.5863078 0 NA 0
I would prefer not to use external packages if possible, but it seems like I'm missing something basic. This question was similar, but had to do with data.tables, not data.frames.
Another was even closer, but using ave() is also tedious to specify, for instance, columns 1-12, 15-17, and 26 as the subject columns, and mysteriously, sd()
produces those NA's. Seems like there should be a straightforward way to do this. Almost makes me wish for Excel. :-)
回答1:
Let us first bring your data into an acceptable format. Note that this solution does, against your initial requirements, indeed rely on external libraries, but they are very common and true timesavers today! (plyr and reshape2 by Hadley Wickham, who is a phenomenon in the R community)
# Note how I only used the data columns, initially, there is no mean and sd column in the data frame used at this stage.
df <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5))
df$repetition = c(1:nrow(df))
library(reshape2)
tmp = melt(df, id.vars = "repetition")
names(tmp)[2] = "condition"
tmp$treatment = substring(tmp$condition,1,1)
This yields:
> head(tmp)
repetition condition value treatment
1 1 a.1 0.6668952 a
2 2 a.1 0.1248151 a
3 3 a.1 0.7082199 a
4 4 a.1 0.9840956 a
5 5 a.1 0.4479190 a
6 1 b.1 0.9381539 b
Now, the rest is easy, we rely on the popular plyr package:
library(plyr)
results = ddply(tmp, .(repetition, treatment), summarize, mean = mean(value), sd = sd(value) )
The final result is
> head(results)
repetition treatment mean sd
1 1 a 0.6777342 0.01532853
2 1 b 0.6734955 0.37428353
3 2 a 0.4533126 0.46456561
4 2 b 0.8441925 0.07260509
5 3 a 0.3967338 0.44050779
6 3 b 0.5886821 0.42635902
Let's hope this is what you were looking for.
One more interesting addition, if you do not want to differentiate each repetition, but rather on a treatment level
# addition
results = ddply(tmp, .( treatment), summarize, mean = mean(value), sd = sd(value) )
and the result:
> head(results)
treatment mean sd
1 a 0.5817867 0.2954151
2 b 0.6212537 0.3219035
回答2:
Ignoring the "base-only" requirement to whip the data into shape, using tidyr
and the pipe operator from magrittr
:
set.seed(42)
df <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5))
df2 <- df %>%
gather(treatment, value) %>%
separate(treatment, c("treatment", "repetition"))
head(df2)
# treatment repetition value
# 1 a 1 0.13871017
# 2 a 1 0.98889173
# 3 a 1 0.94666823
# 4 a 1 0.08243756
# 5 a 1 0.51421178
# 6 b 1 0.39020347
Now, I'm not sure what exactly you're trying to get the average and standard deviation of, but one easy option is aggregate()
from base R. Simple pass the function you'd like through the FUN
parameter:
# calculate mean on treatment (a or b)
aggregate(df2$value, by = list(treatment = df2$treatment), FUN = mean)
# treatment repetition x
# 1 a 1 0.5341839
# 2 b 1 0.6633022
# 3 a 2 0.5442395
# 4 b 2 0.4225865
# calculate mean on treatment and repetition
aggregate(df2$value, by = list(treatment = df2$treatment, repetition = df2$repetition, FUN = mean)
# treatment x
# 1 a 0.5392117
# 2 b 0.5429444
回答3:
Based on the code you showed, may be this base R
method would help:
set.seed(42)
df <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5))
do.call(cbind,
lapply(split(seq_along(df),gsub("\\..*", "",colnames(df))), function(x) {
x1 <- df[,x]
data.frame(Means=rowMeans(x1, na.rm=TRUE), SD=apply(x1, 1, sd, na.rm=TRUE))}))
# a.Means a.SD b.Means b.SD
#1 0.6862739 0.3231932 0.7295552 0.29763438
#2 0.8280938 0.1541232 0.8574074 0.17086395
#3 0.6104059 0.4585819 0.1260770 0.01214755
#4 0.5429382 0.4065997 0.5659947 0.12869005
#5 0.5520192 0.1268922 0.6326988 0.10234101
Using your code, I get the same result
vec1 <- vector("numeric", length=5)
for(i in c("a.1", "a.2")) {vec1 <- vec1+df[[i]]}
vec1/2
#[1] 0.6862739 0.8280938 0.6104059 0.5429382 0.5520192
来源:https://stackoverflow.com/questions/25900795/how-do-i-create-a-column-of-means-of-specific-columns-in-a-data-frame