Suppose I have a data with a factor and response variable. My questions:
- How linear regression and mixed effect models work with the factor variables?
- If I have a separate model for each level of the factor variable
(m3 and m4)
, how does that differ with modelsm1
andm2
? - Which one is the best model/approach?
As an example I use Orthodont
data in nlme
package.
library(nlme)
data = Orthodont
data2 <- subset(data, Sex=="Male")
data3 <- subset(data, Sex=="Female")
m1 <- lm (distance ~ age + Sex, data = Orthodont)
m2 <- lme(distance ~ age , data = Orthodont, random = ~ 1|Sex)
m3 <- lm(distance ~ age, data= data2
m4 <- lm(distance ~ age, data= data3)
Q1: How linear regression and mixed effect models work with the factor variables?
A1: Factors are coded as dummy variables (1 = true, 0= false).
For example, model 1's coefficients are:
coef(m1) #lm( distance ~ age + Sex)
#(Intercept) age SexFemale
# 17.7067130 0.6601852 -2.3210227
Calculating distance is therefore:
Distance = 17.71 + 0.66*age - 2.32*SexFemale
where SexFemale is 0 for males and 1 for females. This simplifies to:
Male: Distance = 17.71 + 0.66*age
Female: Distance = 15.39 + 0.66*age
If the model has more categories (ex. overweight, healthy, underweight), the dummy variables are added accordingly:
R code: lm(distance ~ age + weightStatus)
Computations: Distance = age + weightIsOver + weightIsHealthy + weightIsUnder
Three separate coefficients for each weight type are created and multiplied by 0 or 1 depending on an individual's weight type.
Q2: If I have a separate model for each level of the factor variable (m3
and m4
), how does that differ with models m1
and m2
?
A2: The slopes and intercepts change depending on your model.
m1 is a multiple linear regression (MLR) where intercept changes depending on sex but the slope for age is the same. We can also refer to this as random slopes. The linear mixed effects (LME) model m2 also specifies an intercept that varies by sex (1|Sex
).
m3 and m4 ~ Random slopes and random intercepts models because data are separated.
Let's specify a LME with random slopes and random intercepts:
m2a <- lme(distance ~ age, data = Orthodont, random= ~ age | Sex,
control = lmeControl(opt="optim"))
#Changed the optimizer to achieve convergence
Combining the coefficients allows us to examine how the models are structured:
#Combine the model coefficients
coefs <- rbind(
coef(m1)[1:2],
coef(m1)[1:2] + c(coef(m1)[3], 0), #female coefficient added to intercept
coef(m2),
coef(m2a),
coef(m3),
coef(m4)); names(coefs) <- c("intercept", "age")
model.coefs <- data.frame(
model = paste0("m", c(1,1,2,2,"2a", "2a",3,4)),
type = rep(c("MLR", "LME randomIntercept", "LME randomSlopes",
"separate LM"), each=2),
Sex = rep(c("male","female"), 4),
coefs, row.names = 1:8)
model.coefs
# model model2 Sex intercept age #intercept & slope
#1 m1 MLR male 17.70671 0.6601852 #different same
#2 m1 MLR female 15.38569 0.6601852
#3 m2 LME randomIntercept male 17.67197 0.6601852 #different same
#4 m2 LME randomIntercept female 15.43622 0.6601852
#5 m2a LME randomSlopes male 16.65625 0.7540780 #different different
#6 m2a LME randomSlopes female 16.91363 0.5236138
#7 m3 separate LM male 16.34062 0.7843750 #different different
#8 m4 separate LM female 17.37273 0.4795455
Q3: Which one is the best model/approach?
A3: It depends on the situation but probably a mixed effects model.
In your example, m3 and m4 have no relation to each other and inherently have different slopes for each Sex. The LME models can be examined to determine whether random slopes are warranted (ex. anova(m2, m2a)
). Mixed effect models are versatile when you have multiple levels (ex. students within classes within schools) and repeated measures (several measures on the same Subject or across Time). You can also specify covariance structures with these models.
To visualize these different models, let's look at the Orthodont
data:
library(ggplot)
gg <- ggplot(Orthodont, aes(age, distance, fill=Sex)) + theme_bw() +
geom_point(shape=21, position= position_dodge(width=0.2)) +
stat_summary(fun.y = "mean", geom="point", size=8, shape=22, colour="black" ) +
scale_fill_manual(values = c("Male" = "black", "Female" = "white"))
Circles = raw data, Squares = means. Distance appears to increase linearly with age. Males have higher distances than females. The slopes may vary by sex too, with females having a smaller increase in distance with age compared to males. (Note: raw data have been slightly dodged on the x-axis to avoid overplotting.)
Adding our models to the data and zooming in:
gg1 <- gg +
geom_abline(data = model.coefs, size=1.5,
aes(slope = age, intercept = intercept, colour = type, linetype = Sex))
gg1 + coord_cartesian(ylim = c(21, 27)) #zoom in
Here, we see the LME model with random intercepts resembles the MLR model. The LME with random intercepts and random slopes resembles the separate LMs on the subsetted data.
Finally, here is how to make the equivalent of m2
using the lme4
package:
m2 <- lme(distance ~ age , data = Orthodont, random = ~ 1|Sex)
library(lme4)
m5 <- lmer(distance ~ age + (1|Sex), data = Orthodont) #same as m2
More resources:
(Generalized) Linear Mixed Models FAQ
Comparing nlme
and lme4
using Orthodont
data.
来源:https://stackoverflow.com/questions/36555639/how-do-regression-models-deal-with-the-factor-variables