问题
This question arose as a result of another question posted here: non-conformable arguments error from lmer when trying to extract information from the model matrix
When trying to obtain predicted means from an lmer model containing a factor variable, the output varies depending on how the factor variable is specified.
I have a variable agegroup, which can be specified using the groups "Children <15 years", "Adults 15-49 years", "Elderly 50+ years" or "0-15y", "15-49y", "50+y". My choice matters because for the former, the alphabetical ordering of the labels differs from the numeric ordering of the levels. To illustrate this, I have again used the sleep data.
library(lme4)
sleep <- as.data.frame(sleepstudy) #import the sleep data
I have to create a variable for age.
set.seed(13) #set a seed for creating a new variable, age
sleep$age <- sample(1:3,length(sleep),rep=TRUE) #create a new variable, age
sleep$agegroup1 <- factor(sleep$age, levels = c(1,2,3),
labels = c("Children <15 years", "Adults 15-49 years", "Elderly 50+ years"))
table(sleep$agegroup) #should have 3 age groups
run the model
m1 <- lmer(Reaction ~ Days + agegroup1 + Days:agegroup1 + (Days | Subject), sleep)
summary(m1)
# New data frame for predicted means
d <- seq(0,9,1) # make a vector of days = 0 to 9
newdat1 <- data.frame(Days=d,
agegroup1=factor(rep(levels(sleep$agegroup1),length(d))))
newdat1 <- newdat1[order(newdat1$Days,newdat1$agegroup1),] #order by Days
mm <- model.matrix(formula(m1,fixed.only=TRUE)[-2], newdat1) #create the matrix
Now, I try to output the predicted means using the model matrix and also the predict function:
newdat1$mm <- mm%*%fixef(m1)
newdat1$predict <- predict(m1, newdata=newdat1, re.form=NA)
head(newdat1)
Here, the predicted means from the model matrix and the predict function are different; the Adults and Children age groups are inverted.
Days agegroup1 mm predict
11 0 Adults 15-49 years 252.2658 252.8241
1 0 Children <15 years 252.8241 252.2658
21 0 Elderly 50+ years 249.1254 249.1254
2 1 Adults 15-49 years 262.3326 263.2674
22 1 Children <15 years 263.2674 262.3326
12 1 Elderly 50+ years 260.0171 260.0171
If I run this script again using factor labels for which the alphabetical ordering is the same as the numeric ordering of the levels, I get different results:
#set new labels for agegroup
sleep$agegroup2 <- factor(sleep$age, levels = c(1,2,3),
labels = c("0-15y", "15-49y", "50+y"))
m2 <- lmer(Reaction ~ Days + agegroup2 + Days:agegroup2 + (Days | Subject), sleep)
summary(m2)
# New data frame for predicted means
d <- seq(0,9,1) # make a vector of days = 0 to 9
newdat2 <- data.frame(Days=d,
agegroup2=factor(rep(levels(sleep$agegroup2),length(d))))
newdat2 <- newdat2[order(newdat2$Days,newdat2$agegroup2),] #order by Days
mm <- model.matrix(formula(m2,fixed.only=TRUE)[-2], newdat2)
newdat2$mm <- mm%*%fixef(m2)
newdat2$predict <- predict(m2, newdata=newdat2, re.form=NA)
head(newdat2)
Here, the predicted means from the model matrix and the predict function are the same.
Days agegroup2 mm predict
1 0 0-15y 252.2658 252.2658
11 0 15-49y 252.8241 252.8241
21 0 50+y 249.1254 249.1254
22 1 0-15y 262.3326 262.3326
2 1 15-49y 263.2674 263.2674
12 1 50+y 260.0171 260.0171
Predict appears to ignore the labels and focus on the levels, while directly accessing the model-matrix correctly focusses on the labels. My question, then, is whether it is always necessary to ensure that factor levels and labels have the same order when trying to use the model matrix? Or is there some other way to overcome this problem?
回答1:
The order of columns of the model matrix and of the fixed effects from the model must match in order to correctly do the matrix multiplication to calculate the predicted values "by hand". This means, yes, the order of the levels of the factor in the new dataset must be the same as in the original dataset to use model.matrix
and fixef
as you did.
You can achieve this by setting the order of the factor levels in your new dataset. This is easiest to do by simply using the levels of the factor from the original dataset. For example, in newdat1
you can do:
factor(rep(levels(sleep$agegroup1), length(d)), levels = levels(sleep$agegroup1)))
来源:https://stackoverflow.com/questions/34346755/predict-and-model-matrix-give-different-predicted-means-within-levels-of-a-facto