I am trying to understand how R determines reference groups for interactions in a linear model. Consider the following:

df <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 
5L, 5L, 5L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"), class = "factor"), 
    year = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
    1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
    2L, 1L, 2L, 1L, 2L), .Label = c("1", "2"), class = "factor"), 
    treatment = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor"), 
    y = c(1.4068142116718, 2.67041187927052, 2.69166439169131, 
    3.56550324537293, 1.60021286173782, 4.26629963353237, 3.85741108250572, 
    5.15740731957689, 4.15629768365669, 6.14875441068499, 3.31277276551286, 
    3.47223277168367, 3.74152201649338, 4.02734382610191, 4.49388620764795, 
    5.6432833241724, 4.76639399631094, 4.16885857079297, 4.96830394378801, 
    5.6286092105837, 6.60521404151111, 5.51371821706176, 3.97244221149279, 
    5.68793413111161, 4.90457233598412, 6.02826151378941, 4.92468415350312, 
    8.23718422822134, 5.87695836962708, 7.47264895892575)), .Names = c("id", 
"year", "treatment", "y"), row.names = c(NA, -30L), class = "data.frame")


lm(y ~ -1 + id + year + year:treatment, df)

#Coefficients:
#             id1               id2               id3               id4  
#          2.6585            3.9933            4.1161            5.3544  
#             id5             year2  year1:treatment1  year2:treatment1  
#          6.1991            0.7149           -0.6317                NA

R tries to estimate the full set of interactions instead of consistently omitting a reference group. As a result, I am getting NA's in the results.

Also, R is inconsistent with which groups it drops. I would like to estimate a model with the same omitted group (year1) in the main effect and interactions. How to force R to omit year1 and year1:treatment1 from the preceding model?

I understand that there are several workarounds for this problem (e.g. creating all the variables by hand and writing them out in the model's formula). But the actual models I am estimating are much more complicated versions of this problem and such a workaround would be cumbersome.

李哲源

R tries to estimate the full set of interactions instead of consistently omitting a reference group. As a result, I am getting NAs in the results.

There is no causality between those two. You get NA purely because your variables have nesting.

R is inconsistent with which groups it drops. I would like to estimate a model with the same omitted group (year1) in the main effect and interactions. How to force R to omit year1 and year1:treatment1 from the preceding model?

There is no inconsistency but model.matrix has its own rule. You get seemingly "inconsistent" contrasts because you don't have main effect treatment but only interaction treatment:year.

In the following, I will first explain NA coefficients, then the contrasts for interaction, and finally give some suggestions.

`NA` coefficients

By default, contrast treatment is used for contrasting factor, and by default of contr.treatement, the first factor level is taken as reference level. Have a look at your data:

levels(df$id)
# [1] "1" "2" "3" "4" "5"

levels(df$year)
# [1] "1" "2"

levels(df$treatment)
# [1] "0" "1"

Now take a look at a simple linear model:

lm(y ~ id + year + treatment, df)

#Coefficients:
#(Intercept)          id2          id3          id4          id5        year2  
#      2.153        1.651        1.773        2.696        3.541        1.094  
# treatment1  
#         NA

You can see that id1, year1 and treatment0 are not there, as they are taken as reference.

Even without interaction, you already have an NA coefficient. This implies that treatment is nested in span{id, year}. When you further include treatment:year, such nesting still exists.

In fact, a further test shows that treatment is nested in id:

lm(y ~ id + treatment, df)

#    Coefficients:
#(Intercept)          id2          id3          id4          id5   treatment1  
#      2.700        1.651        1.773        2.696        3.541           NA

In summary, variable treatment is completely redundant for your modelling purpose. If you include id, then there is no need to include treatment or treatment:* where * can be any variable.

It is very easy to get nesting when you have many factor variables in a regression model. Note, contrasts will not necessarily remove nesting, as contrast only recognises variable name, but not potential numerical feature. Take the following example for how to cheat contr.treatment:

df$ID <- df$id
lm(y ~ id + ID, df)

#Coefficients:
#(Intercept)          id2          id3          id4          id5          ID2  
#      2.700        1.651        1.773        2.696        3.541           NA  
#        ID3          ID4          ID5  
#         NA           NA           NA

Look, contrasts works as expected, but ID is nested in id so we have rank-deficiency.

Contrasts for interaction

We first remove the noise imposed by NA, by dropping id variable. Then, a regression model with treatment and year will be full-rank, so no NA should be seen if contrasts is successful.

Contrasts for interaction, or high-order effects, depends on the presence of low-order effects. Compare the following models:

lm(y ~ year:treatment, df)  ## no low-order effects at all

#Coefficients:
#     (Intercept)  treatment0:year1  treatment1:year1  treatment0:year2  
#          5.4523           -1.3976           -1.3466           -0.6826  
#treatment1:year2  
#              NA  

lm(y ~ year + treatment + year:treatment, df)  ## with all low-order effects

#Coefficients:
#     (Intercept)        treatment1             year2  treatment1:year2  
#         4.05471           0.05094           0.71493           0.63170

In the first model, no contrasts is done, because there is no presence of main effects, but only the 2nd order effect. The NA here is due to the absence of contrasts.

Now consider another set of examples, by including some but not all main effects:

lm(y ~ year + year:treatment, df)  ## main effect `treatment` is missing

#Coefficients:
#     (Intercept)             year2  year1:treatment1  year2:treatment1  
#         4.05471           0.71493           0.05094           0.68264  

lm(y ~ treatment + year:treatment, df)  ## main effect `year` is missing

#Coefficients:
#     (Intercept)        treatment1  treatment0:year2  treatment1:year2  
#         4.05471           0.05094           0.71493           1.34663

Here, contrasts for interaction will happen to the variable whose main effect is missing. For example, in the first model, main effect treatment is missing so interaction drops treatement0; while in the second model, main effect year is missing so interaction drops year1.

A general guideline

Always include all low-order effects when specifying high-order effects. This not only gives easy-to-understand contrasts behaviour, but also has some other appealing statistical reason. You can read Including the interaction but not the main effects in a model on Cross Validated.

Another suggestion, is to always include intercept. In linear regression, a model with intercept yields unbiased estimate, and residuals will have 0 mean.

来源：https://stackoverflow.com/questions/40723196/why-do-i-get-na-coefficients-and-how-does-lm-drop-reference-level-for-interact

标签

regression

linear-regression

Why do I get NA coefficients and how does `lm` drop reference level for interaction

NA coefficients

Contrasts for interaction

A general guideline

`NA` coefficients