问题
I'm trying to find a model for my data but I get the message "Coefficients: (3 not defined because of singularities)" These occur for winter, large and high_flow
I found this: https://stats.stackexchange.com/questions/13465/how-to-deal-with-an-error-such-as-coefficients-14-not-defined-because-of-singu
which said it may be incorrect dummy variables, but I've checked that none of my columns are duplicates.
when I use the function alias() I get:
Model :
S ~ A + B + C + D + E + F + G + spring + summer + autumn + winter + small + medium + large + low_flow + med_flow + high_flow
Complete :
(Intercept) A B C D E F G spring summer autumn small medium
winter 1 0 0 0 0 0 0 0 -1 -1 -1 0 0
large 1 0 0 0 0 0 0 0 0 0 0 -1 -1
high_flow 1 0 0 0 0 0 0 0 0 0 0 0 0
low_flow med_flow
winter 0 0
large 0 0
high_flow -1 -1
columns A-H of my data contain numeric values the remaining columns take 0 or 1, and I have checked there are no conflicting values (i.e. if spring = 1 for a case, autumn=summer=winter=0)
model_1 <- lm(S ~ A+B+C+D+E+F+G+spring+summer+autumn+winter+small+medium+large+low_flow+med_flow+high_flow, data = trainOne)
summary(model_1)
Can someone explain the error please?
EDIT: example of my data before I changed it to binary
season size flow A B C D E F G S
spring small medium 52 72 134 48 114 114 142 11
autumn small medium 43 21 98 165 108 23 60 31
spring medium medium 41 45 161 86 177 145 32 12
autumn large medium 40 86 132 80 82 138 186 16
winter medium high 49 32 147 189 125 43 144 67
summer large high 43 9 158 64 14 146 15 71
回答1:
@JuliusVainora has already given you a good explanation of how the error occurs, which I will not repeat. However, Julius' answer is only one method and might not be satisfying if you don't understand that there really is a value for cases where winter = 1, large=1 and high_flow=1. It can readily be seen in the display as the value for "(Intercept)". You might be able to make the result more interpretable by adding +0
to your formula. (Or it might not, depending on the data situation.)
However, I think that you really should re-examine how your coding of categorical variables is done. You are using a method of one dummy variable per level that you are copying from some other system, perhaps SAS or SPSS? That's going to predictably cause problems for you in the future, as well as being a painful method to code and maintain. R's data.frame function already automagically creates factor
's that encode multiple levels in a single variable. (Read ?factor
.) So your formula would become:
S ~ A + B + C + D + E + F + G + season + size + flow
回答2:
The issue is perfect collinearity. Namely,
spring + summer + autumn + winter == 1
small + medium + large == 1
low_flow + med_flow + high_flow == 1
Constant term == 1
By this I mean that those identities hold for each observation individually. (E.g., only one of the seasons is equal to one.)
So, for instance, lm
cannot distinguish between the intercept and the sum of all the seasons' effects. Perhaps this or this will help to get the idea better. More technically, the OLS estimates involve a certain matrix that is not invertible in this case.
To fix this, you may run, e.g.,
model_1 <- lm(S ~ A + B + C + D + E + F + G + spring + summer + autumn + small + medium + low_flow + med_flow, data = trainOne)
Also see this question.
回答3:
Some of you variables could be perfectly collinear. Take a look at the variables and how they correlate with each other. You can start inspecting the data with cor(dataset)
, this will return a correlation matrix of your dataset
.
来源:https://stackoverflow.com/questions/53989003/what-is-causing-this-error-coefficients-not-defined-because-of-singularities