How do regression models deal with the factor variables?

问题

Suppose I have a data with a factor and response variable. My questions:

How linear regression and mixed effect models work with the factor variables?
If I have a separate model for each level of the factor variable (m3 and m4), how does that differ with models m1 and m2?
Which one is the best model/approach?

As an example I use Orthodont data in nlme package.

library(nlme)
data = Orthodont
data2 <- subset(data, Sex=="Male")
data3 <- subset(data, Sex=="Female")

m1 <- lm (distance ~ age + Sex, data = Orthodont) 
m2 <- lme(distance ~ age , data = Orthodont, random = ~ 1|Sex)

m3 <- lm(distance ~ age, data= data2
m4 <- lm(distance ~ age, data= data3)

回答1:

Q1: How linear regression and mixed effect models work with the factor variables?
A1: Factors are coded as dummy variables (1 = true, 0= false).
For example, model 1's coefficients are:

coef(m1)    #lm( distance ~ age + Sex)
#(Intercept)         age   SexFemale 
# 17.7067130   0.6601852  -2.3210227

Calculating distance is therefore:
Distance = 17.71 + 0.66*age - 2.32*SexFemale
where SexFemale is 0 for males and 1 for females. This simplifies to:
Male: Distance = 17.71 + 0.66*age
Female: Distance = 15.39 + 0.66*age

If the model has more categories (ex. overweight, healthy, underweight), the dummy variables are added accordingly:
R code: lm(distance ~ age + weightStatus)
Computations: Distance = age + weightIsOver + weightIsHealthy + weightIsUnder
Three separate coefficients for each weight type are created and multiplied by 0 or 1 depending on an individual's weight type.

Q2: If I have a separate model for each level of the factor variable (m3 and m4), how does that differ with models m1 and m2?
A2: The slopes and intercepts change depending on your model.
m1 is a multiple linear regression (MLR) where intercept changes depending on sex but the slope for age is the same. We can also refer to this as random slopes. The linear mixed effects (LME) model m2 also specifies an intercept that varies by sex (1|Sex).
m3 and m4 ~ Random slopes and random intercepts models because data are separated.

Let's specify a LME with random slopes and random intercepts:

m2a <- lme(distance ~ age, data = Orthodont, random= ~ age | Sex,
            control = lmeControl(opt="optim"))  
            #Changed the optimizer to achieve convergence

Combining the coefficients allows us to examine how the models are structured:

#Combine the model coefficients
coefs <- rbind(
                coef(m1)[1:2],                     
                coef(m1)[1:2] + c(coef(m1)[3], 0), #female coefficient added to intercept
                coef(m2),
                coef(m2a),
                coef(m3),
                coef(m4)); names(coefs) <- c("intercept", "age")
model.coefs <- data.frame(
                   model = paste0("m", c(1,1,2,2,"2a", "2a",3,4)),
                   type  = rep(c("MLR", "LME randomIntercept", "LME randomSlopes", 
                                  "separate LM"), each=2),
                   Sex = rep(c("male","female"), 4), 
                   coefs, row.names = 1:8)

model.coefs
#  model              model2    Sex intercept       age  #intercept & slope 
#1    m1                 MLR   male  17.70671 0.6601852  #different   same 
#2    m1                 MLR female  15.38569 0.6601852  
#3    m2 LME randomIntercept   male  17.67197 0.6601852  #different   same
#4    m2 LME randomIntercept female  15.43622 0.6601852 
#5   m2a    LME randomSlopes   male  16.65625 0.7540780  #different  different
#6   m2a    LME randomSlopes female  16.91363 0.5236138
#7    m3         separate LM   male  16.34062 0.7843750  #different  different
#8    m4         separate LM female  17.37273 0.4795455

Q3: Which one is the best model/approach?
A3: It depends on the situation but probably a mixed effects model.

In your example, m3 and m4 have no relation to each other and inherently have different slopes for each Sex. The LME models can be examined to determine whether random slopes are warranted (ex. anova(m2, m2a)). Mixed effect models are versatile when you have multiple levels (ex. students within classes within schools) and repeated measures (several measures on the same Subject or across Time). You can also specify covariance structures with these models.

To visualize these different models, let's look at the Orthodont data:

library(ggplot)
gg <- ggplot(Orthodont, aes(age, distance, fill=Sex)) + theme_bw() +
        geom_point(shape=21, position= position_dodge(width=0.2)) +  
        stat_summary(fun.y = "mean", geom="point", size=8, shape=22, colour="black" ) +
        scale_fill_manual(values = c("Male" = "black", "Female" = "white"))

Circles = raw data, Squares = means. Distance appears to increase linearly with age. Males have higher distances than females. The slopes may vary by sex too, with females having a smaller increase in distance with age compared to males. (Note: raw data have been slightly dodged on the x-axis to avoid overplotting.)

Adding our models to the data and zooming in:

gg1 <- gg +  
            geom_abline(data = model.coefs, size=1.5,
               aes(slope = age, intercept = intercept, colour = type, linetype = Sex)) 
gg1 + coord_cartesian(ylim = c(21, 27)) #zoom in

Here, we see the LME model with random intercepts resembles the MLR model. The LME with random intercepts and random slopes resembles the separate LMs on the subsetted data.

Finally, here is how to make the equivalent of m2 using the lme4 package:

m2 <- lme(distance ~ age , data = Orthodont, random = ~ 1|Sex)
library(lme4)
m5 <- lmer(distance ~ age + (1|Sex), data = Orthodont)  #same as m2

More resources:
(Generalized) Linear Mixed Models FAQ
Comparing nlme and lme4 using Orthodont data.

来源：https://stackoverflow.com/questions/36555639/how-do-regression-models-deal-with-the-factor-variables

标签

regression

mixed-models

nlme