My question is related to this one regarding categorical data (factors in R terms) when using the Caret package. I understand from the linked post that if you use the \"fo
It is really the same issue as the post you link to. preProcess
works only on numeric data and you have:
> str(etitanic)
'data.frame': 1046 obs. of 6 variables:
$ pclass : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ survived: int 1 1 0 0 0 1 1 0 1 0 ...
$ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
$ age : num 29 0.917 2 30 25 ...
$ sibsp : int 0 1 1 1 1 0 1 0 2 0 ...
$ parch : int 0 2 2 2 2 0 0 0 0 0 ...
You can't center and scale pclass
or sex
as-is so they need to be converted to dummy variables. You can use model.matrix
or caret's dummyVars
to do this:
> new <- model.matrix(survived ~ . - 1, data = etitanic)
> colnames(new)
[1] "pclass1st" "pclass2nd" "pclass3rd" "sexmale" "age"
[6] "sibsp" "parch"
The -1
gets rid of the intercept. Now you can run preProcess
on this object.
btw making preProcess
ignore non-numeric data is on my "to do" list but it might cause errors for people not paying attention.
Max