问题
I was running a regression using categorical variables and came across this question. Here, the user wanted to add a column for each dummy. This left me quite confused because I though having long data with the column including all the dummies stored using as.factor()
was equivalent to having dummy variables.
Could someone explain the difference between the following two linear regression models?
Linear Model 1, where Month is a factor:
dt_long
Sales Period Month
1: 0.4898943 1 M1
2: 0.3097716 1 M1
3: 1.0574771 1 M1
4: 0.5121627 1 M1
5: 0.6650744 1 M1
---
8108: 0.5175480 24 M12
8109: 1.2867316 24 M12
8110: 0.6283875 24 M12
8111: 0.6287151 24 M12
8112: 0.4347708 24 M12
M1 <- lm(data = dt_long,
fomrula = Sales ~ Period + factor(Month)
Linear Model 2 where each month is an indicator variable:
dt_wide
Sales Period M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
1: 0.4898943 1 1 0 0 0 0 0 0 0 0 0 0 0
2: 0.3097716 1 1 0 0 0 0 0 0 0 0 0 0 0
3: 1.0574771 1 1 0 0 0 0 0 0 0 0 0 0 0
4: 0.5121627 1 1 0 0 0 0 0 0 0 0 0 0 0
5: 0.6650744 1 1 0 0 0 0 0 0 0 0 0 0 0
---
8108: 0.5175480 24 0 0 0 0 0 0 0 0 0 0 0 1
8109: 1.2867316 24 0 0 0 0 0 0 0 0 0 0 0 1
8110: 0.6283875 24 0 0 0 0 0 0 0 0 0 0 0 1
8111: 0.6287151 24 0 0 0 0 0 0 0 0 0 0 0 1
8112: 0.4347708 24 0 0 0 0 0 0 0 0 0 0 0 1
M2 <- lm(data = data_wide,
formula = Sales ~ Period + M1 + M2 + M3 + ... + M11 + M12
Judging by this previously asked question, both models seem exactly the same. However, after running both models, I noticed that M1
returns 11 dummy estimators (because M1 is used as the reference level), while M2 returns 12 dummies.
Is one model better than the other? Is M1 more efficien? Can I set the reference level in M1 to make both models exactly equivalent?
回答1:
Defining a model as in M1
is just a shortcut of including dummy variables: if you wanted to compute those regression coefficients by hand, clearly they'd have to be numeric.
Now something that perhaps you didn't notice about M2
is that one of the dummies should have a NA coefficient. That is because you manually included all of them and left the intercept. In this way we have a perfect collinearity issue. By not including one of the dummies or adding -1
to eliminate the constant term everything would be fine.
Some examples. Let
y <- rnorm(100)
x0 <- rep(1:0, each = 50)
x1 <- rep(0:1, each = 50)
x <- factor(x1)
In this way x0
and x1
is a decomposition of x
. Then
## Too much
lm(y ~ x0 + x1)
# Call:
# lm(formula = y ~ x0 + x1)
# Coefficients:
# (Intercept) x0 x1
# -0.15044 0.07561 NA
## One way to fix it
lm(y ~ x0 + x1 - 1)
# Call:
# lm(formula = y ~ x0 + x1 - 1)
# Coefficients:
# x0 x1
# -0.07483 -0.15044
## Another one
lm(y ~ x1)
# Call:
# lm(formula = y ~ x1)
# Coefficients:
# (Intercept) x1
# -0.07483 -0.07561
## The same results
lm(y ~ x)
# Call:
# lm(formula = y ~ x)
# Coefficients:
# (Intercept) x1
# -0.07483 -0.07561
Ultimately all the models contain the same amount of information, but in the case of multicollinearity we face the issue of identification.
回答2:
- Improper dummy coding.
When you change a categorical variable into dummy variables, you will have one fewer dummy variable than you had categories. That’s because the last category is already indicated by having a 0 on all other dummy variables. Including the last category just adds redundant information, resulting in multicollinearity. So always check your dummy coding if it seems you’ve got a multicollinearity problem.
来源:https://stackoverflow.com/questions/54471095/difference-between-categorical-variables-factors-and-dummy-variables