How to drop NA observation of factors conditionally when doing linear regression in R?

问题

I'm trying to do a simple linear regression model in R.

there are three factor variables in the model.

the model is

lm(Exercise ~ Econ + Job + Position)

where "Exercise" is numeric dependent variable, the amount of time exercising.

"Econ", "Job", "Position" are all factor variables.

"Econ" is whether a person is employed or not. (levels = employed / not employed)

"Job" is the job type a person has. There are five levels for this variable.

"Position" is the position a person has in the workplace. There are five levels for this variable also.

I tried to do a linear regression and got an error,

"contrasts can be applied only to factors with 2 or more levels"

I think this error is due to NA in the factor level, because if "Econ" is equal to 'unemployed', "Job" and "Position" has NA value. (Since obviously, unemployed people does not have job type and job position)

If I regress two model separately like below, no error occurs.

lm(Exercise ~ Econ)

lm(Exercise ~ Job + Position)

However, I want one model that can automatically use variables as needed, and one result table. So if "Econ" is 'employed', then "Job", "Position" variable is used for regression. If "Econ" is 'unemployed', then "Job", "Position" variable is automatically dropped from the model.

The reason I want one model instead of two model is by putting all variables in the model, I can see the effect of "Econ"(employed or unemployed) among people who are 'employed'

If I just regress

lm(Exercise ~ Job + Position)

I do not know the effect of employment.

I thought of a solution to put 0 = 'unemployed level' for all NA values of "Job" and "Position", but I am not sure this will solve problem, and thought this might lead to multicollinearity problem.

is there any way to automatically/conditionally drop NA observations according to some other factor variable?

Below are my reproducible example.

    Exercise <- c(50, 30, 25, 44, 32, 50 ,22, 14)
    Econ <- as.factor(c(1, 0, 1, 1, 0, 0, 1, 1)) 
    # 0 = unemployed, 1 =  employed

    Job <- as.factor(c("A", NA, "B", "B", NA, NA, "A", "C"))

    Position <- as.factor(c("Owner", NA,"Employee", "Owner", 
                        NA, NA, "Employee", "Director")) 

    data <- data.frame(Exercise, Econ, Job, Position)

    str(data)

    lm(Exercise ~ Econ + Job + Position)

    lm(Exercise ~ Econ)

    lm(Exercise ~ Job + Position)

Here what I want is first model lm(Exercise ~ Econ + Job + Position), but I get an error, because for all Econ = 0(Unemployed), Job and Position value is NA.

回答1:

If you really truly just want the first model to run without errors (assuming the same missing values handling you are using), then you could do this.

lm(Exercise ~ as.integer(Econ) + Job + Position)

Note, that all you have really done is found the same result as the third model.

lm(Exercise ~ Job + Position) # third model
lm(Exercise ~ as.integer(Econ) + Job + Position) # first model

coef(lm(Exercise ~ Job + Position))
coef(lm(Exercise ~ as.integer(Econ) + Job + Position))

Unless you change how you are handling missing values, the first model that you want lm(Exercise ~ Econ + Job + Position) would be equivalent to the third model lm(Exercise ~ Job + Position) Here is why.

By default, na.action = na.omit within the lm function. This means that any rows with any missing values for the predictor or response variables will be dropped. There are multiple ways you can see this. One is by applying model.matrix which is what lm will do under the hood.

model.matrix(Exercise ~ Econ + Job + Position)
  (Intercept) Econ1 JobB JobC PositionEmployee PositionOwner
1           1     1    0    0                0             1
3           1     1    1    0                1             0
4           1     1    1    0                0             1
7           1     1    0    0                1             0
8           1     1    0    1                0             0

As you already correctly pointed out, Econ = 0 is perfectly aligned with position = NA . Thus, lm is dropping those observations and you end up with Econ having a single value which lm does not know how to handle a factor with a single level. I bypassed this error by using as.integer() however, you still end up with a predictor with only a single value.

Next, lm will silently drop such predictors which is why you are getting an NA for the coefficient on as.integer(Econ). This is because the default for singular.ok = TRUE.

If you were to set singular.ok = FALSE you would get an error that is basically saying that you are trying to fit a model that has only a single value for a predictor.

lm(Exercise ~ as.integer(Econ) + Job + Position, singular.ok = FALSE)
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  singular fit encountered

来源：https://stackoverflow.com/questions/47493639/how-to-drop-na-observation-of-factors-conditionally-when-doing-linear-regression

标签

factors