问题
I'm trying to estimate factors that determine the difference in happiness level between people living in New York and Chicago.
Data looks like below.
Happiness City Gender Employment Worktype Holiday
1 60 New York 0 0 Unemployed Unemployed
2 80 Chicago 1 1 Whitecolor 1 day a week
3 39 Chicago 0 0 Unemployed Unemployed
4 40 New York 1 0 Unemployed Unemployed
5 69 Chicago 1 1 Bluecolor 2 day a week
6 90 Chicago 1 1 Bluecolor 2 day a week
7 100 New York 0 1 Whitecolor 2 day a week
8 30 New York 1 1 Whitecolor 1 day a week
Happiness level is dependent variable, and 'city' is where the person lives. 'Gender' is coded 0 = man 1 = woman. 'Employment' is 0 = Unemployed and 1 = Employed. 'Worktype' is three level factor: 'Unemployed', 'Whitecolor', 'Bluecolor'. 'Holiday' is how many days a person rest in a week. Here 'City', 'Gender', 'Worktype' and 'Holiday' variables are all factors. 'Happiness' and 'Employment' variable types are numerical.
The Model I want to estimate is
lm(Happiness ~ City + Gender + Employment:(Worktype + Holiday))
I left 'Employment' value as numerical value so if 'Employment' is equal to 0(Unemployed), 0:(Worktype + Holiday) = 0, so the model is automatically reduced to
lm(Happiness ~ City + Gender)
for unemployed people.
However, regression result returns NA values.
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.75 23.56 2.408 0.138
CityNew York -14.50 27.21 -0.533 0.647
Gender1 -2.25 35.99 -0.063 0.956
Employment:WorktypeBluecolor 25.00 43.02 0.581 0.620
Employment:WorktypeUnemployed NA NA NA NA
Employment:WorktypeWhitecolor 57.75 35.99 1.604 0.250
Employment:Holiday1 day a week -50.00 54.42 -0.919 0.455
Employment:Holiday2 day a week NA NA NA NA
this seems to be due to 'Unemployment' value in 'Worktype' and 'Holiday' variable. However, I am not sure why R is not treating Employment:WorktypeUnemployed which is obviously 0:Worktype = 0 as zero and not removing it from the model. Is this because R is setting Employment:HolidayUnemployed as a baseline and both are perfectly multicollinear? (I had to put 'Unemployed' value for 'Worktype' and 'Holiday' because I wanted to see the effect of 'Worktype' and 'Holiday' compared to 'Unemployed' people. If I remove 'Unemployed' value NA disappears, but baseline will be 'Whitecolor' and '1day a week' so I cannot see the effect compared to 'unemployed'.)
If so, Why am I getting NA for coefficients for 'Employement:Holiday2 day a week'? It seems that it has nothing to do with 'Unemployed' value.
Can I rely on this result while just removing NA coefficients?
below are reproducible code.
Happiness <- c(60, 80, 39, 40, 69, 90, 100, 30)
City <- as.factor(c("New York", "Chicago", "Chicago", "New York", "Chicago",
"Chicago", "New York", "New York"))
Gender <- as.factor(c(0, 1, 0, 1, 1, 1, 0, 1)) # 0 = man, 1 = woman.
Employment <- c(0,1, 0, 0, 1 ,1 , 1 , 1) # 0 = unemployed, 1 = employed.
Worktype <- as.factor(c("Unemployed", "Whitecolor", "Unemployed",
"Unemployed", "Bluecolor", "Bluecolor", "Whitecolor","Whitecolor"))
Holiday <- as.factor(c(0, 1, 0, 0, 2, 2, 2, 1))
levels(Holiday) <- c("Unemployed", "1 day a week", "2 day a week")
data <- data.frame(Happiness, City, Gender, Employment, Worktype, Holiday)
head(data,8)
str(data)
reg <- lm(Happiness ~ City + Gender + Employment:(Worktype + Holiday))
summary(reg)
回答1:
You shouldn't worry about the NA values for Employment:WorktypeUnemployed
. R tries automatically to compute all the interactions, but that particular coefficient remains undetermined because, clearly, it is never the case that Employment=1 and Worktype="Unemployed". It does not have any effect on the computations of the other coefficients: you can verify by manually coding the dummy variables:
> library(lme4) # for the convenient "dummy" function
> data <- data.frame(data,
+ dummy(Worktype, c("Bluecolor","Whitecolor")),
+ h1=dummy(Holiday)[,1],
+ h2=dummy(Holiday)[,2])
>
> reg <- lm(Happiness ~ City + Gender + Employment:Bluecolor + Employment:Whitecolor + Employment:h1 + Employment:h2 , data)
> summary(reg)
Call:
lm(formula = Happiness ~ City + Gender + Employment:Bluecolor +
Employment:Whitecolor + Employment:h1 + Employment:h2, data = data)
Residuals:
1 2 3 4 5 6 7 8
1.775e+01 1.775e+01 -1.775e+01 8.882e-16 -1.050e+01 1.050e+01 4.441e-15 -1.775e+01
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.75 23.56 2.408 0.138
CityNew York -14.50 27.21 -0.533 0.647
Gender1 -2.25 35.99 -0.063 0.956
Employment:Bluecolor 25.00 43.02 0.581 0.620
Employment:Whitecolor 57.75 35.99 1.604 0.250
Employment:h1 -50.00 54.42 -0.919 0.455
Employment:h2 NA NA NA NA
Residual standard error: 27.21 on 2 degrees of freedom
Multiple R-squared: 0.6798, Adjusted R-squared: -0.1208
F-statistic: 0.8491 on 5 and 2 DF, p-value: 0.619
The estimated coefficients are identical even though Employment:WorktypeUnemployed
is not present anymore.
However, the NA values are still present for Employment:h2
(equivalent to Employment:Holiday2 day a week
). This seems due to the fact that in this reduced dataset you end up with a singular model matrix (i.e. one column is a linear combination of other columns)
> solve(crossprod(model.matrix(reg)))
Error in solve.default(crossprod(model.matrix(reg))) :
system is computationally singular: reciprocal condition number = 1.79897e-18
So this issue may not be present with a larger dataset. Eventually, you could try to drop any redundancy in the model (e.g., are there any employed with 0 days per week of holiday? if not then 1 day should be the baseline, and you would add extra columns to code for days of holiday >1). You can use the alias()
function to check which term is giving the issue.
来源:https://stackoverflow.com/questions/47976109/na-values-when-regressing-with-dummy-variable-interaction-term