What is the difference between the aov(depvar~timevar+Error(id))
and the aov(depvar~timevar+Error(id/timevar))
formula specifications? These two varian
The difference between aov(depvar~timevar+Error(id))
and aov(depvar~timevar+Error(id/timevar))
is whether or not you include timevar
as a random effect.
Note that there's more than one way to include a variable as a random effect. You could also use aov(depvar~timevar+Error(id*timevar))
or aov(depvar~timevar+Error(id + timevar))
as well. Each of these means something quite different, but it can be confusing because they'll often give you similar results when applied to the same dataset, due to the constraints of the data themselves.
The slash /
used in aov()
denotes nesting. When you use /
, R automatically expands it to the main effect of the bottom variable plus the interaction between the bottom and the top. For example, A/B
automatically expands to A + A:B
. This is similar to how A*B
automatically expands to A + B + A:B
, but with nesting, the variable in the nest never appears outside of its nest (i.e. there can be no main effect of B
on its own).
You can see this expansion happening in your output:
> summary(aov(var~time1+Error(id / time1)))
Error: id
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 1 52.24 52.24
Error: id:time1
Df Sum Sq Mean Sq
time1 1 4291 4291
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
time1 1 1239 1238.7 10.19 0.00167 **
Residuals 176 21399 121.6
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The Error
terms denote random effects. Notice you get one for the main effect of id
because it's the base of the nest, and one for the interaction between id
and time1
, because time1
is nested within id
(you also get an Error
effect for Within
which is the basic residual term for the model, i.e. the random effect of the individual observations themselves).
It depends on 1) how your data are actually structured and 2) what model you intend to run. Note: There's no definitive test you can run on the data to determine the structure or the correct model; this is a thinking exercise rather than a computational one.
In the example models you provided, you have an outcome var
, and then what appear to be grouping variables group
and id
, and then two time variables time1
and time2
. Each id
is only in 1 group, not across both groups, suggesting that id is nested within group.
> table(group, id)
id
group 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 9 0 9 0 9 0 9 0 9 0 9 0 9 0 9 0 9 0 9 0
2 0 9 0 9 0 9 0 9 0 9 0 9 0 9 0 9 0 9 0 9
I'll assume that id
refers to a single participant, and the 9 measurements on time1
and time2
are within-subjects tests on each participant (i.e. each participant was measured 9 times on var
, so this is a repeated measures design).
To make it concrete, let's say var
is a score on some problem solving task, and time1
and time2
are the minutes participants are allowed to study the problem and the amount of time they're given to complete the problem, respectively. Since time1
and time2
are crossed, each participant completes the task 9 times, under each combination of circumstances.
> table(time1, time2)
time2
time1 8 15 20
1 20 20 20
2 20 20 20
3 20 20 20
> table(time1, time2, id)
, , id = 1
time2
time1 8 15 20
1 1 1 1
2 1 1 1
3 1 1 1
, , id = 2
time2
time1 8 15 20
1 1 1 1
2 1 1 1
3 1 1 1
(output truncated)
Participants are tested in groups, with half of the participants in group 1 and the other half in group 2. Perhaps the study was run in classrooms, and group 1 is one class and group 2 is the second class. Probably, group identity is not actually a variable of interest, but we shouldn't leave it out of the model because there may be some nuisance variance resulting from differences between the groups. For example, maybe the first classroom had better lighting, giving all of the members of group 1 and better chance at scoring well on the puzzles than the members of group 2.
Scores, ID, and Group should all be random effects, and time1 and time2 should be fixed effects (note this could vary for the same data if you had different thoughts in the model; e.g. you may want to consider group as fixed depending on your research question).
Given that model, this would be the most complete version of the model, using aov()
:
aov(var~time1*time2 + Error(group/id/(time1*time2)),data=df)
Here's the output:
> summary(aov(var~time1*time2 + Error(group/id/(time1*time2)),data=df))
Error: group
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 1 771.7 771.7
Error: group:id
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 18 243.8 13.55
Error: group:id:time1
Df Sum Sq Mean Sq F value Pr(>F)
time1 2 7141 3571 181.6 <2e-16 ***
Residuals 38 747 20
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Error: group:id:time2
Df Sum Sq Mean Sq F value Pr(>F)
time2 2 16353 8176 434.6 <2e-16 ***
Residuals 38 715 19
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Error: group:id:time1:time2
Df Sum Sq Mean Sq F value Pr(>F)
time1:time2 4 214.5 53.63 5.131 0.00103 **
Residuals 76 794.3 10.45
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In aov(var ~ time1 * time2 + Error(group/id/(time1 * time2)), data = df) :
Error() model is singular
(Along with the links above, here is some additional guidance on random vs. fixed effects)
Here's a blog post that'll help break down what each means under the "Random Effects in Classical ANOVA" section.
From the blog, here's a summary for what "dividing" in the Error
term means.
aov(Y ~ Error(A), data=d) # Lone random effect
aov(Y ~ B + Error(A/B), data=d) # A random, B fixed, B nested within A
aov(Y ~ (B*X) + Error(A/(B*X)), data=d) # B and X interact within levels of A
So from your question,
aov(depvar~timevar+Error(id/timevar))
means you have a random effect from id
but then fix timevar
with timevar
nested within id
levels versus
aov(depvar~timevar+Error(id))
which is just taking id
as the random effects with no constraint on other variables.
Source: http://conjugateprior.org/2013/01/formulae-in-r-anova/
This might prove useful as well, which is some code going over analysis of variance that has some recommendations on learning ANOVA.