How to preProcess features when some of them are factors?

后端 未结 2 847
有刺的猬
有刺的猬 2021-01-04 05:05

My question is related to this one regarding categorical data (factors in R terms) when using the Caret package. I understand from the linked post that if you use the \"fo

相关标签:
2条回答
  • 2021-01-04 05:57

    Here's a quick way to exclude factors or whatever you'd like from consideration:

    set.seed(1)
    N <- 20
    dat <- data.frame( 
        x = factor(sample(LETTERS[1:5],N,replace=TRUE)),
        y = rnorm(N,5,12),
        z = rnorm(N,-5,17) + runif(N,2,12)
    )
    
    #' Function which wraps preProcess to exclude factors from the model.matrix
    ppWrapper <- function( x, excludeClasses=c("factor"), ... ) {
        whichToExclude <- sapply( x, function(y) any(sapply(excludeClasses, function(excludeClass) is(y,excludeClass) )) )
        processedMat <- predict( preProcess( x[!whichToExclude], ...), newdata=x[!whichToExclude] )
        x[!whichToExclude] <- processedMat
        x
    }
    
    > ppWrapper(dat)
       x          y           z
    1  C  1.6173595 -0.44054795
    2  A -0.2933705 -1.98856921
    3  C  1.2177384  0.65420288
    4  D -0.8710374  0.62409408
    5  D -0.4504202 -0.34048640
    6  D -0.6943283  0.24236671
    7  E  0.7778192  0.91606677
    8  D  0.2184563 -0.44935163
    9  C -0.3611408  0.26075970
    10 B -0.7066441 -0.23046073
    11 D -1.5154339 -0.75549761
    12 D  0.4504825  0.38552988
    13 B  1.5692675  0.04093040
    14 C  0.4127541  0.13161807
    15 D  0.5426321  1.09527418
    16 B -2.1040322 -0.04544407
    17 C  0.6928574  1.12090541
    18 B  0.3580960  1.91446230
    19 E  0.3619967 -0.89018040
    20 A -1.2230522 -2.24567237
    

    You can pass anything you want into ppWrapper and it will get passed along to preProcess.

    0 讨论(0)
  • 2021-01-04 06:06

    It is really the same issue as the post you link to. preProcess works only on numeric data and you have:

    > str(etitanic)
    'data.frame':   1046 obs. of  6 variables:
     $ pclass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
     $ survived: int  1 1 0 0 0 1 1 0 1 0 ...
     $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
     $ age     : num  29 0.917 2 30 25 ...
     $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
     $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...
    

    You can't center and scale pclass or sex as-is so they need to be converted to dummy variables. You can use model.matrix or caret's dummyVars to do this:

     > new <- model.matrix(survived ~ . - 1, data = etitanic)
     > colnames(new)
     [1] "pclass1st" "pclass2nd" "pclass3rd" "sexmale"   "age"      
     [6] "sibsp"     "parch"  
    

    The -1 gets rid of the intercept. Now you can run preProcess on this object.

    btw making preProcess ignore non-numeric data is on my "to do" list but it might cause errors for people not paying attention.

    Max

    0 讨论(0)
提交回复
热议问题