问题
I am trying to fit a lineal model with some categorical variables
model <- lm(price ~ carat+cut+color+clarity)
summary(model)
The answer is:
Call:
lm(formula = price ~ carat + cut + color + clarity)
Residuals:
Min 1Q Median 3Q Max
-11495.7 -688.5 -204.1 458.2 9305.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3696.818 47.948 -77.100 < 2e-16 ***
carat 8843.877 40.885 216.311 < 2e-16 ***
cut.L 755.474 68.378 11.049 < 2e-16 ***
cut.Q -349.587 60.432 -5.785 7.74e-09 ***
cut.C 200.008 52.260 3.827 0.000131 ***
cut^4 12.748 42.642 0.299 0.764994
color.L 1905.109 61.050 31.206 < 2e-16 ***
color.Q -675.265 56.056 -12.046 < 2e-16 ***
color.C 197.903 51.932 3.811 0.000140 ***
color^4 71.054 46.940 1.514 0.130165
color^5 2.867 44.586 0.064 0.948729
color^6 50.531 40.771 1.239 0.215268
clarity.L 4045.728 108.363 37.335 < 2e-16 ***
clarity.Q -1545.178 102.668 -15.050 < 2e-16 ***
clarity.C 999.911 88.301 11.324 < 2e-16 ***
clarity^4 -665.130 66.212 -10.045 < 2e-16 ***
clarity^5 920.987 55.012 16.742 < 2e-16 ***
clarity^6 -712.168 52.346 -13.605 < 2e-16 ***
clarity^7 1008.604 45.842 22.002 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1167 on 4639 degrees of freedom
Multiple R-squared: 0.9162, Adjusted R-squared: 0.9159
F-statistic: 2817 on 18 and 4639 DF, p-value: < 2.2e-16
But I don't understand why the answers are with ".L,.Q,.C,^4, ...", something is wrong but I don't know what is wrong, I already tried with the function factor for each variable.
回答1:
You are encountering how “ordered” ( ordinal ) factor variables are handled by regression functions and the default set of contrasts are orthogonal polynomial contrasts up to degree n-1, where n is the number of levels for that factor. It's not going to be very easy to interpret that result ... especially if there is no natural order. Even if there is, and there might well be in this case, you might not want the default ordering (which is alphabetical by factor level) and you probably don't want to have more than a few of degrees in the polynomial contrasts.
In the case of ggplot2's diamonds dataset, the factor levels are set up correctly but most newbies when they stumble across ordered factors get ordered levels like "Excellent" <"Fair" < "Good"< "Poor". (Fail)
> levels(diamonds$cut)
[1] "Fair" "Good" "Very Good" "Premium" "Ideal"
> levels(diamonds$clarity)
[1] "I1" "SI2" "SI1" "VS2" "VS1" "VVS2" "VVS1" "IF"
> levels(diamonds$color)
[1] "D" "E" "F" "G" "H" "I" "J"
One methid to use ordered factors when they have been set up correctly is to just wrap them in as.numeric
which gives you a linear test of trend.
> contrasts(diamonds$cut) <- contr.treatment(5) # Removes ordering
> model <- lm(price ~ carat+cut+as.numeric(color)+as.numeric(clarity), diamonds)
> summary(model)
Call:
lm(formula = price ~ carat + cut + as.numeric(color) + as.numeric(clarity),
data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-19130.3 -696.1 -176.8 556.9 9599.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5189.460 36.577 -141.88 <2e-16 ***
carat 8791.452 12.659 694.46 <2e-16 ***
cut2 909.433 35.346 25.73 <2e-16 ***
cut3 1129.518 32.772 34.47 <2e-16 ***
cut4 1156.989 32.427 35.68 <2e-16 ***
cut5 1264.128 32.160 39.31 <2e-16 ***
as.numeric(color) -318.518 3.282 -97.05 <2e-16 ***
as.numeric(clarity) 522.198 3.521 148.31 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1227 on 53932 degrees of freedom
Multiple R-squared: 0.9054, Adjusted R-squared: 0.9054
F-statistic: 7.371e+04 on 7 and 53932 DF, p-value: < 2.2e-16
回答2:
Since @Roland didn't post what he he thought would be a better approach (and I kind of agreed with him), I needed to educate myself on how a real statistician would do this in R. I eventually found the right coding advice on SO in a post by @SvenHohenstein: How to properly set contrasts in R The reason I like to use as.numeric
is that I know how to interpret the coefficients. The coefficient is the 'effect' (remembering that the work 'effect' does not imply causation) of a one level difference in level on the LHS-outcome or dependent variable. Looking at my first answer which at the moment is above this one, you see values around 1000 for the coefficients of cut2-5 and no value for cut1. The contribution of the "value" for cut==1 is buried inside the '(Intercept)'. The estimates look like:
> cbind( levels(diamonds$cut), c(coef(model.cut)[grep('Intercept|cut', names(coef(model.cut)))] ))
[,1] [,2]
(Intercept) "Fair" "-5189.46034442502"
cut2 "Good" "909.432743872746"
cut3 "Very Good" "1129.51839934007"
cut4 "Premium" "1156.98898349819"
cut5 "Ideal" "1264.12800574865"
You could plot the unadjusted means, but the unadjusted values don't really make sense, (thus emphasizing the need for regression analyses):
> with(diamonds, tapply(price, cut, mean))
Fair Good Very Good Premium Ideal
4358.758 3928.864 3981.760 4584.258 3457.542
So look at effect of cut within quintiles of carat
:
> with(diamonds, tapply(price, list(cut, cut2(carat, g=5) ), mean))
[0.20,0.36) [0.36,0.54) [0.54,0.91) [0.91,1.14) [1.14,5.01]
Fair 802.4528 1193.162 2336.543 4001.972 8682.351
Good 574.7482 1101.406 2701.412 4872.072 9788.294
Very Good 597.9258 1151.537 2727.251 5464.223 10158.057
Premium 717.1096 1149.550 2537.446 5214.787 10131.999
Ideal 739.8972 1254.229 2624.180 6050.358 10317.725
So an effect of ... what? maybe an average of 800 across the full range of values of 'cut' for a two-way analysis?
contrasts(diamonds$cut, how.many=1) <- poly(1:5)
> model.cut2 <- lm(price ~ carat+cut, diamonds)
> model.cut2
Call:
lm(formula = price ~ carat + cut, data = diamonds)
Coefficients:
(Intercept) carat cut1
-2555.1 7838.5 815.8
> contrasts(diamonds$cut)
1
Fair -0.6324555
Good -0.3162278
Very Good 0.0000000
Premium 0.3162278
Ideal 0.6324555
The average difference in estimated price holding carat
constant for Fair versus Ideal would be ( -0.6324555 -0.6324555)*815.8 or a price difference of minus 1031.91 (dollars or euros.... whatever the units of the price variable)
I've decide to remove a bunch of other stuff I was going to put in here, because I think this adequately demonstrates my essential point that one needs to understand the underlying coding in order to properly interpret and communicate the magnitude of "effects". The coefficients alone are not meaningful. The linear contrasts from poly
create an effect-coefficient that is essentially for a "full" range difference. One needs to do comparisons using both the contrast matrix values and the estimated coefficients if using R poly()
. The range of contrasts are typically around 1 and linear contrasts are centered on 0.
回答3:
A reasonable apriori approach that conserves power here would be to evaluate the linear, quartic, and cubic contrasts. That allows most plausible models, and avoids testing those higher-order polynomials allowed by the large number of levels, but which would William of Ockham unwell if relied upon in theory :-)
library(ggplot2)
df = diamonds[1:1000, ] # a chunk of data
contrasts(df$cut , how.many=3) = contr.poly(nlevels(df$cut))
contrasts(df$color , how.many=3) = contr.poly(nlevels(df$color))
contrasts(df$clarity, how.many=3) = contr.poly(nlevels(df$clarity))
model <- lm(price ~ carat+cut+color+clarity, data = df)
summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -692.74 30.99 -22.353 < 2e-16 ***
carat 4444.79 41.37 107.431 < 2e-16 ***
cut.L 286.35 22.31 12.835 < 2e-16 ***
cut.Q -88.61 20.26 -4.374 1.35e-05 ***
cut.C 120.91 18.51 6.532 1.03e-10 ***
color.L -660.17 24.93 -26.476 < 2e-16 ***
color.Q -119.34 23.90 -4.993 7.03e-07 ***
color.C 37.18 20.90 1.779 0.0756 .
clarity.L 1356.12 43.22 31.375 < 2e-16 ***
clarity.Q -220.86 33.48 -6.596 6.87e-11 ***
clarity.C 375.47 31.10 12.073 < 2e-16 ***
Multiple R-squared: 0.929, Adjusted R-squared: 0.9283
F-statistic: 1293 on 10 and 989 DF, p-value: < 2.2e-16
来源:https://stackoverflow.com/questions/30159162/linear-model-with-categorical-variables-in-r