ANOVA on multiple responses, by multiple groups NOT part of formula

蓝咒 提交于 2019-12-11 03:40:26

问题


I'm struggling to figure out how to do some ANOVA testing for a collection of responses in multiple environments & sample types. I have a big dataframe, and I want to run these ANOVAs separately by these grouping factors, but they shouldn't be included in the formula.

Maybe I'm missing an obvious way to do this, but so far I've just been subsetting the data is tedious to do..

In SAS, this would be similar to a BY statement within the PROC statement.. it is quite easy to do in SAS.

Here is an example:

data(iris)
data = iris

# suppose the iris data was in two environments..
data$location = unlist(rep(list("US", "USSR"), length(data$Sepal.Length)/2))
data$location = as.factor(data$location)
# and suppose that there was another subgrouping..
data$subgroup = unlist(list(rep(c(rep("A", 25), rep("B", 25)), 3)))
data$subgroup = as.factor(data$subgroup)
# and suppose I only want to look at the differences between two species
somedata = subset(data, Species == "setosa" | Species == "versicolor")
somedata = droplevels(somedata)
# suppose that I want to test if sepal length and sepal width are different..
# between species BY location AND BY subgroup
# and I'm only interested in the pvalues for these comparisons
# in my real data, I have more than just two responses I want to test..

# I could subset all the data.. which is what I've been doing..
# by location
dataUS = subset(somedata, location == "US")
dataUSSR = subset(somedata, location == "USSR")

# then by species
dataUS_A = subset(dataUS, subgroup=="A")
dataUS_B = subset(dataUS, subgroup=="B")

dataUSSR_A = subset(dataUSSR, subgroup=="A")
dataUSSR_B = subset(dataUSSR, subgroup=="B")

t.test(Sepal.Width ~ Species, data=dataUS_A)

Can someone offer a faster way to get me the p values for these comparisons? Possibly with multiple testing control, such as tukey?

I've also tried making the data in long format, and using the subset option, but that is almost as tedious.

I've looked at anova, aov, and a couple other approaches but I've gotten stuck with each. I've also tried making my data wide format and doing something like this:

summary( aov(as.matrix(cbind(somedata[,c(1:2)])) ~ Species*location*subgroup, data=somedata) )

but I still can't figure out how to separate by a group & subgroup

I've also tried concatenating columns into one big "group" column and using that new column as my grouping but that doesn't work either:

somedata$group = do.call(paste, c(somedata[c("location","subgroup")], sep = "_"))

回答1:


Okay, first off, your code is unnecessarily complicated. Check out several simplifications below that will hopefully make things simpler.

Then, to your core question, you want a split-apply-combine strategy. You need to split the data by the relevant grouping variables, then do a t.test in each subset of the data. You can achieve this with a combination of split and lapply. (Per your comment, to obtain results for multiple outcomes, you need nested lapply functions).

# the data, again
data <- iris
data$location <- factor(rep(c("US", "USSR"), length.out = length(data$Sepal.Length)))
data$subgroup <- factor(rep(c(rep("A", 25), rep("B", 25)), 3))
somedata <- data[data$Species %in% c("setosa","versicolor"),]

DVs <- c('Sepal.Width','Sepal.Length','Petal.Length')
out <- lapply(DVs, function(x){
    lapply(split(somedata, list(somedata$location, somedata$subgroup)),
    function(z) {
        t.test(update(~ Species,paste(x,'~.')), data=z)$p.value
     })
})

Here's the result:

> do.call(cbind, setNames(out,DVs))
       Sepal.Width  Sepal.Length Petal.Length
US.A   9.183405e-05 9.371858e-06 5.852323e-14
USSR.A 0.0001211233 0.0001385488 2.97461e-12 
US.B   8.473525e-06 0.0001902751 4.123647e-11
USSR.B 0.001818272  5.308597e-06 7.593105e-11



回答2:


Simplifying your code a bit:

data(iris)
df = iris
df$location = factor(rep(c("US", "USSR"), nrow(df)/2))
df$subgroup = factor(rep(c("A", "B"), each = 25, 3))
df = subset(df, Species != "virginica")

Defining the function to use in by

grouptest = function(dat){
    t.test(Sepal.Width ~ Species, data = dat)
}

Running by location and subgroup:

by(df, df[,c("location", "subgroup")], FUN = grouptest)



回答3:


Have you looked at the by function in R? If that is not enough then you might consider the plyr package (for split, operate, combine).



来源:https://stackoverflow.com/questions/23280491/anova-on-multiple-responses-by-multiple-groups-not-part-of-formula

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!