Linear model with categorical variables in R

问题

I am trying to fit a lineal model with some categorical variables

model <- lm(price ~ carat+cut+color+clarity)
summary(model)

The answer is:

Call:
lm(formula = price ~ carat + cut + color + clarity)

Residuals:
     Min       1Q   Median       3Q      Max 
-11495.7   -688.5   -204.1    458.2   9305.3 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3696.818     47.948 -77.100  < 2e-16 ***
carat        8843.877     40.885 216.311  < 2e-16 ***
cut.L         755.474     68.378  11.049  < 2e-16 ***
cut.Q        -349.587     60.432  -5.785 7.74e-09 ***
cut.C         200.008     52.260   3.827 0.000131 ***
cut^4          12.748     42.642   0.299 0.764994    
color.L      1905.109     61.050  31.206  < 2e-16 ***
color.Q      -675.265     56.056 -12.046  < 2e-16 ***
color.C       197.903     51.932   3.811 0.000140 ***
color^4        71.054     46.940   1.514 0.130165    
color^5         2.867     44.586   0.064 0.948729    
color^6        50.531     40.771   1.239 0.215268    
clarity.L    4045.728    108.363  37.335  < 2e-16 ***
clarity.Q   -1545.178    102.668 -15.050  < 2e-16 ***
clarity.C     999.911     88.301  11.324  < 2e-16 ***
clarity^4    -665.130     66.212 -10.045  < 2e-16 ***
clarity^5     920.987     55.012  16.742  < 2e-16 ***
clarity^6    -712.168     52.346 -13.605  < 2e-16 ***
clarity^7    1008.604     45.842  22.002  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1167 on 4639 degrees of freedom
Multiple R-squared:  0.9162,    Adjusted R-squared:  0.9159 
F-statistic:  2817 on 18 and 4639 DF,  p-value: < 2.2e-16

But I don't understand why the answers are with ".L,.Q,.C,^4, ...", something is wrong but I don't know what is wrong, I already tried with the function factor for each variable.

回答1:

You are encountering how “ordered” ( ordinal ) factor variables are handled by regression functions and the default set of contrasts are orthogonal polynomial contrasts up to degree n-1, where n is the number of levels for that factor. It's not going to be very easy to interpret that result ... especially if there is no natural order. Even if there is, and there might well be in this case, you might not want the default ordering (which is alphabetical by factor level) and you probably don't want to have more than a few of degrees in the polynomial contrasts.

In the case of ggplot2's diamonds dataset, the factor levels are set up correctly but most newbies when they stumble across ordered factors get ordered levels like "Excellent" <"Fair" < "Good"< "Poor". (Fail)

> levels(diamonds$cut)
[1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"    
> levels(diamonds$clarity)
[1] "I1"   "SI2"  "SI1"  "VS2"  "VS1"  "VVS2" "VVS1" "IF"  
> levels(diamonds$color)
[1] "D" "E" "F" "G" "H" "I" "J"

One methid to use ordered factors when they have been set up correctly is to just wrap them in as.numeric which gives you a linear test of trend.

> contrasts(diamonds$cut) <- contr.treatment(5) # Removes ordering
> model <- lm(price ~ carat+cut+as.numeric(color)+as.numeric(clarity), diamonds)
> summary(model)

Call:
lm(formula = price ~ carat + cut + as.numeric(color) + as.numeric(clarity), 
    data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-19130.3   -696.1   -176.8    556.9   9599.8 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -5189.460     36.577 -141.88   <2e-16 ***
carat                8791.452     12.659  694.46   <2e-16 ***
cut2                  909.433     35.346   25.73   <2e-16 ***
cut3                 1129.518     32.772   34.47   <2e-16 ***
cut4                 1156.989     32.427   35.68   <2e-16 ***
cut5                 1264.128     32.160   39.31   <2e-16 ***
as.numeric(color)    -318.518      3.282  -97.05   <2e-16 ***
as.numeric(clarity)   522.198      3.521  148.31   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1227 on 53932 degrees of freedom
Multiple R-squared:  0.9054,    Adjusted R-squared:  0.9054 
F-statistic: 7.371e+04 on 7 and 53932 DF,  p-value: < 2.2e-16

回答2:

Since @Roland didn't post what he he thought would be a better approach (and I kind of agreed with him), I needed to educate myself on how a real statistician would do this in R. I eventually found the right coding advice on SO in a post by @SvenHohenstein: How to properly set contrasts in R The reason I like to use as.numeric is that I know how to interpret the coefficients. The coefficient is the 'effect' (remembering that the work 'effect' does not imply causation) of a one level difference in level on the LHS-outcome or dependent variable. Looking at my first answer which at the moment is above this one, you see values around 1000 for the coefficients of cut2-5 and no value for cut1. The contribution of the "value" for cut==1 is buried inside the '(Intercept)'. The estimates look like:

> cbind( levels(diamonds$cut), c(coef(model.cut)[grep('Intercept|cut', names(coef(model.cut)))] ))
            [,1]        [,2]               
(Intercept) "Fair"      "-5189.46034442502"
cut2        "Good"      "909.432743872746" 
cut3        "Very Good" "1129.51839934007" 
cut4        "Premium"   "1156.98898349819" 
cut5        "Ideal"     "1264.12800574865"

You could plot the unadjusted means, but the unadjusted values don't really make sense, (thus emphasizing the need for regression analyses):

> with(diamonds, tapply(price, cut, mean))
     Fair      Good Very Good   Premium     Ideal 
 4358.758  3928.864  3981.760  4584.258  3457.542

So look at effect of cut within quintiles of carat:

> with(diamonds, tapply(price, list(cut, cut2(carat, g=5) ), mean))
          [0.20,0.36) [0.36,0.54) [0.54,0.91) [0.91,1.14) [1.14,5.01]
Fair         802.4528    1193.162    2336.543    4001.972    8682.351
Good         574.7482    1101.406    2701.412    4872.072    9788.294
Very Good    597.9258    1151.537    2727.251    5464.223   10158.057
Premium      717.1096    1149.550    2537.446    5214.787   10131.999
Ideal        739.8972    1254.229    2624.180    6050.358   10317.725

So an effect of ... what? maybe an average of 800 across the full range of values of 'cut' for a two-way analysis?

contrasts(diamonds$cut, how.many=1) <- poly(1:5)
> model.cut2 <- lm(price ~ carat+cut, diamonds)
> model.cut2

Call:
lm(formula = price ~ carat + cut, data = diamonds)

Coefficients:
(Intercept)        carat         cut1  
    -2555.1       7838.5        815.8  

> contrasts(diamonds$cut)
                   1
Fair      -0.6324555
Good      -0.3162278
Very Good  0.0000000
Premium    0.3162278
Ideal      0.6324555

The average difference in estimated price holding carat constant for Fair versus Ideal would be ( -0.6324555 -0.6324555)*815.8 or a price difference of minus 1031.91 (dollars or euros.... whatever the units of the price variable)

I've decide to remove a bunch of other stuff I was going to put in here, because I think this adequately demonstrates my essential point that one needs to understand the underlying coding in order to properly interpret and communicate the magnitude of "effects". The coefficients alone are not meaningful. The linear contrasts from poly create an effect-coefficient that is essentially for a "full" range difference. One needs to do comparisons using both the contrast matrix values and the estimated coefficients if using R poly(). The range of contrasts are typically around 1 and linear contrasts are centered on 0.

回答3:

A reasonable apriori approach that conserves power here would be to evaluate the linear, quartic, and cubic contrasts. That allows most plausible models, and avoids testing those higher-order polynomials allowed by the large number of levels, but which would William of Ockham unwell if relied upon in theory :-)

library(ggplot2)
df = diamonds[1:1000, ] # a chunk of data
contrasts(df$cut    , how.many=3) = contr.poly(nlevels(df$cut))
contrasts(df$color  , how.many=3) = contr.poly(nlevels(df$color))
contrasts(df$clarity, how.many=3) = contr.poly(nlevels(df$clarity))
model <- lm(price ~ carat+cut+color+clarity, data = df)
summary(model)


Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -692.74      30.99 -22.353  < 2e-16 ***
carat        4444.79      41.37 107.431  < 2e-16 ***
cut.L         286.35      22.31  12.835  < 2e-16 ***
cut.Q         -88.61      20.26  -4.374 1.35e-05 ***
cut.C         120.91      18.51   6.532 1.03e-10 ***
color.L      -660.17      24.93 -26.476  < 2e-16 ***
color.Q      -119.34      23.90  -4.993 7.03e-07 ***
color.C        37.18      20.90   1.779   0.0756 .  
clarity.L    1356.12      43.22  31.375  < 2e-16 ***
clarity.Q    -220.86      33.48  -6.596 6.87e-11 ***
clarity.C     375.47      31.10  12.073  < 2e-16 ***

Multiple R-squared:  0.929, Adjusted R-squared:  0.9283 
F-statistic:  1293 on 10 and 989 DF,  p-value: < 2.2e-16

来源：https://stackoverflow.com/questions/30159162/linear-model-with-categorical-variables-in-r

标签

factors