问题
I have a data frame called thetas
containing about 2.7 million observations.
> str(thetas)
'data.frame': 2700000 obs. of 8 variables:
$ rho_cnd : num 0 0 0 0 0 0 0 0 0 0 ...
$ pct_cnd : num 0 0 0 0 0 0 0 0 0 0 ...
$ sx : num 1 2 3 4 5 6 7 8 9 10 ...
$ model : Factor w/ 7 levels "dN.mN","dN.mL",..: 1 1 1 1 1 1 1 1 1 1 ...
$ estTheta : num -1.58 -1.716 0.504 -2.296 0.98 ...
$ trueTheta : num 0.0962 -3.3913 3.6006 -0.1971 2.1906 ...
$ estError : num -1.68 1.68 -3.1 -2.1 -1.21 ...
$ trueAberSx: num 0 0 0 0 0 0 0 0 0 0 ...
I would like to use ddply
, or some similar function, to sum the error of estimation (the column estError
in my data frame), but where the sums are within each condition of my simulation. The problem is, I don't have a simple way to combine values from the other columns of this data frame to uniquely identify all those conditions. To be more specific: the column model
contains 7 possible values. Three of these possible values are only matched up with one possible value in each of rho_cnd
and pct_cnd
, while the other four possible values of model
are matched up with 6 possible pairings of values in rho_cnd
and pct_cnd
.
The obvious solution, I know, would be to go back and make a variable that uniquely identifies all the conditions that I would need to identify here, so that the following code would work:
> sums <- ddply(thetas,.(condition1,condition2,etc.),sum(estError))
But I just don't want to go back and recreate how this data frame is built. Right now I have two data frames created with two separate calls to expand.grid
that are then rbind
ed and sorted to create a data frame listing all valid conditions, but even if I kept those few lines of code in I'm not sure how to reference them with ddply
. I would rather not even use this solution, but I will if necessary.
> conditions
models rhos pcts
1 dN.mN 0.0 0.00
2 dN.mL 0.0 0.00
3 dN.mH 0.0 0.00
4 dL.mN 0.1 0.01
12 dL.mN 0.1 0.02
20 dL.mN 0.1 0.10
8 dL.mN 0.2 0.01
16 dL.mN 0.2 0.02
24 dL.mN 0.2 0.10
5 dL.mL 0.1 0.01
13 dL.mL 0.1 0.02
21 dL.mL 0.1 0.10
9 dL.mL 0.2 0.01
17 dL.mL 0.2 0.02
25 dL.mL 0.2 0.10
6 dH.mN 0.1 0.01
14 dH.mN 0.1 0.02
22 dH.mN 0.1 0.10
10 dH.mN 0.2 0.01
18 dH.mN 0.2 0.02
26 dH.mN 0.2 0.10
7 dH.mH 0.1 0.01
15 dH.mH 0.1 0.02
23 dH.mH 0.1 0.10
11 dH.mH 0.2 0.01
19 dH.mH 0.2 0.02
27 dH.mH 0.2 0.10
Any advice for better code and/or more efficiency? Thanks!
回答1:
I agree with the comment that ddply(thetas,.(model,rho_cnd,pct_cnd),...)
should work. If certain combinations of those variables don't show up, ddply(..., .drop=TRUE) will ensure that the unobserved combinations don't show up.
However, if you wanted to avoid ddply looking through some of the non-existant combinations, you could try something like the following:
#newCond <- apply(thetas[,c("model", "rho_cnd", "pct_cnd")], 1, paste, collapse="_")
newCond <- do.call(paste, thetas[,c("model", "rho_cnd", "pct_cnd")], sep="_") #as suggested by baptiste
thetas2 <- cbind(thetas, newCond)
I admit, the above code might run slowly for you, so I'm not sure it's what you want. But from there you should be able to use ddply() with .variables=newCond.
Furthermore, because you're returning only a single number for each subset of the data, you could just use aggregate, if you wanted.
sums <- aggregate(thetas2[,"estError"], by=thetas2[,"newCond"], colSums)
I hope this helps.
来源:https://stackoverflow.com/questions/16363834/must-ddply-use-all-possible-combinations-of-the-splitting-variables-or-only-o