问题
I want to calculate the mean for each numeric variable in the following example. These need to be grouped by each factor associated with "id" and by each factor associated with"status".
set.seed(10)
dfex <-
data.frame(id=c("2","1","1","1","3","2","3"),status=c("hit","miss","miss","hit","miss","miss","miss"),var3=rnorm(7),var4=rnorm(7),var5=rnorm(7),var6=rnorm(7))
For the means of "id" groups, the first row of output would be labeled "mean-id-1". Rows labeled "mean-id-2" and "mean-id-3" would follow. For the means of "status" groups, the rows would be labeled "mean-status-miss" and "mean-status-hit". My objective is to generate these means and their row labels programatically.
I've tried many different permutations of apply functions, but each has issues. I've also experimented with the aggregate function.
回答1:
With base R the following works for the "id" column:
means_id <- aggregate(dfex[,grep("var",names(dfex))],list(dfex$id),mean)
rownames(means_id) <- paste0("mean-id-",means_id$Group.1)
means_id$Group.1 <- NULL
Output:
var3 var4 var5 var6
mean-id-1 -0.7182503 -0.2604572 -0.3535823 -1.3530417
mean-id-2 0.2042702 -0.3009548 0.6121843 -1.4364211
mean-id-3 -0.4567655 0.8716131 0.1646053 -0.6229102
The same for the "status" column:
means_status <- aggregate(dfex[,grep("var",names(dfex))],list(dfex$status),mean)
rownames(means_status) <- paste0("mean-status-",means_status$Group.1)
means_status$Group.1 <- NULL
回答2:
Probably the fastest way to do this will be with data.table
(for big data sets), although I didn't find a way to present new row names in data.table
object, thus I converted it back to data.frame
library(data.table)
setDT(dfex) # convert `dfex` to a `data.table` object
#setkey(dfex, id) # This is not necessary, only if you want to sort your table by "id" column first
dat1 <- as.data.frame(dfex[,-2, with = F][, lapply(.SD, mean), by = id])
rownames(dat1) <- paste0("mean-id-", as.character(dat1[,"id"]))
dat2 <- as.data.frame(dfex[,-1, with = F][, lapply(.SD, mean), by = status])
rownames(dat2) <- paste0("mean-status-", as.character(dat2[,"status"]))
回答3:
You could do:
do.call(rbind,by(dfex[,-(1:2)], paste("mean-id",dfex[,1],sep="-"), colMeans))
var3 var4 var5 var6
mean-id-1 -0.7383944 0.5005763 -0.4777325 0.6988741
mean-id-2 -0.0316267 -0.1764453 0.1313834 0.6867287
mean-id-3 0.7489377 0.8091953 0.9290247 -0.1263163
Create both result as a list:
lapply(c("id","status"), function(x) do.call(rbind,by(dfex[grep("var",names(dfex))], paste("mean-id",dfex[,x],sep="-"), colMeans)))
Update:
library(matrixStats)
lapply(c("id","status"), function(x) do.call(rbind,by(dfex[grep("var",names(dfex))], paste("mean-id",dfex[,x],sep="-"), colSds)))
[[1]]
var3 var4 var5 var6
mean-id-1 0.6024318 1.36423044 0.5398717 0.7260939
mean-id-2 0.2623706 0.08870122 0.1827246 1.0590560
mean-id-3 1.0625137 0.16381062 1.0760977 0.3524908
[[2]]
var3 var4 var5 var6
mean-id-hit 0.4369311 1.036234 0.6622341 0.6506010
mean-id-miss 0.8288436 1.035163 0.7688912 0.6799636
来源:https://stackoverflow.com/questions/24198629/how-to-create-summaries-of-subgroups-based-on-factors-in-r