How to preProcess features when some of them are factors?

后端 未结 2 848
有刺的猬
有刺的猬 2021-01-04 05:05

My question is related to this one regarding categorical data (factors in R terms) when using the Caret package. I understand from the linked post that if you use the \"fo

2条回答
  •  有刺的猬
    2021-01-04 06:06

    It is really the same issue as the post you link to. preProcess works only on numeric data and you have:

    > str(etitanic)
    'data.frame':   1046 obs. of  6 variables:
     $ pclass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
     $ survived: int  1 1 0 0 0 1 1 0 1 0 ...
     $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
     $ age     : num  29 0.917 2 30 25 ...
     $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
     $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...
    

    You can't center and scale pclass or sex as-is so they need to be converted to dummy variables. You can use model.matrix or caret's dummyVars to do this:

     > new <- model.matrix(survived ~ . - 1, data = etitanic)
     > colnames(new)
     [1] "pclass1st" "pclass2nd" "pclass3rd" "sexmale"   "age"      
     [6] "sibsp"     "parch"  
    

    The -1 gets rid of the intercept. Now you can run preProcess on this object.

    btw making preProcess ignore non-numeric data is on my "to do" list but it might cause errors for people not paying attention.

    Max

提交回复
热议问题