Why do I get NA coefficients and how does `lm` drop reference level for interaction

感情迁移 提交于 2019-12-01 05:26:40
李哲源

R tries to estimate the full set of interactions instead of consistently omitting a reference group. As a result, I am getting NAs in the results.

There is no causality between those two. You get NA purely because your variables have nesting.

R is inconsistent with which groups it drops. I would like to estimate a model with the same omitted group (year1) in the main effect and interactions. How to force R to omit year1 and year1:treatment1 from the preceding model?

There is no inconsistency but model.matrix has its own rule. You get seemingly "inconsistent" contrasts because you don't have main effect treatment but only interaction treatment:year.

In the following, I will first explain NA coefficients, then the contrasts for interaction, and finally give some suggestions.


NA coefficients

By default, contrast treatment is used for contrasting factor, and by default of contr.treatement, the first factor level is taken as reference level. Have a look at your data:

levels(df$id)
# [1] "1" "2" "3" "4" "5"

levels(df$year)
# [1] "1" "2"

levels(df$treatment)
# [1] "0" "1"

Now take a look at a simple linear model:

lm(y ~ id + year + treatment, df)

#Coefficients:
#(Intercept)          id2          id3          id4          id5        year2  
#      2.153        1.651        1.773        2.696        3.541        1.094  
# treatment1  
#         NA  

You can see that id1, year1 and treatment0 are not there, as they are taken as reference.

Even without interaction, you already have an NA coefficient. This implies that treatment is nested in span{id, year}. When you further include treatment:year, such nesting still exists.

In fact, a further test shows that treatment is nested in id:

lm(y ~ id + treatment, df)

#    Coefficients:
#(Intercept)          id2          id3          id4          id5   treatment1  
#      2.700        1.651        1.773        2.696        3.541           NA

In summary, variable treatment is completely redundant for your modelling purpose. If you include id, then there is no need to include treatment or treatment:* where * can be any variable.

It is very easy to get nesting when you have many factor variables in a regression model. Note, contrasts will not necessarily remove nesting, as contrast only recognises variable name, but not potential numerical feature. Take the following example for how to cheat contr.treatment:

df$ID <- df$id
lm(y ~ id + ID, df)

#Coefficients:
#(Intercept)          id2          id3          id4          id5          ID2  
#      2.700        1.651        1.773        2.696        3.541           NA  
#        ID3          ID4          ID5  
#         NA           NA           NA  

Look, contrasts works as expected, but ID is nested in id so we have rank-deficiency.


Contrasts for interaction

We first remove the noise imposed by NA, by dropping id variable. Then, a regression model with treatment and year will be full-rank, so no NA should be seen if contrasts is successful.

Contrasts for interaction, or high-order effects, depends on the presence of low-order effects. Compare the following models:

lm(y ~ year:treatment, df)  ## no low-order effects at all

#Coefficients:
#     (Intercept)  treatment0:year1  treatment1:year1  treatment0:year2  
#          5.4523           -1.3976           -1.3466           -0.6826  
#treatment1:year2  
#              NA  

lm(y ~ year + treatment + year:treatment, df)  ## with all low-order effects

#Coefficients:
#     (Intercept)        treatment1             year2  treatment1:year2  
#         4.05471           0.05094           0.71493           0.63170  

In the first model, no contrasts is done, because there is no presence of main effects, but only the 2nd order effect. The NA here is due to the absence of contrasts.

Now consider another set of examples, by including some but not all main effects:

lm(y ~ year + year:treatment, df)  ## main effect `treatment` is missing

#Coefficients:
#     (Intercept)             year2  year1:treatment1  year2:treatment1  
#         4.05471           0.71493           0.05094           0.68264  

lm(y ~ treatment + year:treatment, df)  ## main effect `year` is missing

#Coefficients:
#     (Intercept)        treatment1  treatment0:year2  treatment1:year2  
#         4.05471           0.05094           0.71493           1.34663  

Here, contrasts for interaction will happen to the variable whose main effect is missing. For example, in the first model, main effect treatment is missing so interaction drops treatement0; while in the second model, main effect year is missing so interaction drops year1.


A general guideline

Always include all low-order effects when specifying high-order effects. This not only gives easy-to-understand contrasts behaviour, but also has some other appealing statistical reason. You can read Including the interaction but not the main effects in a model on Cross Validated.

Another suggestion, is to always include intercept. In linear regression, a model with intercept yields unbiased estimate, and residuals will have 0 mean.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!