I have a data set with NAs
sprinkled generously throughout.
In addition it has columns that need to be factors()
.
I am using th
Because of inconsistent behavior on these points between packages, not to mention the extra trickiness when going to more "meta" packages like caret
, I always find it easier to deal with NAs and factor variables up front, before I do any machine learning.
model.matrix()
. It will let you generate a series of "dummy" features for the different levels of the factor. The typical usage is something like this:> dat = data.frame(x=factor(rep(1:3, each=5)))
> dat$x
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
> model.matrix(~ x - 1, data=dat)
x1 x2 x3
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 0 1 0
7 0 1 0
8 0 1 0
9 0 1 0
10 0 1 0
11 0 0 1
12 0 0 1
13 0 0 1
14 0 0 1
15 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
Also, just in case you haven't (although it sounds like you have), the caret
vignettes on CRAN are very nice and touch on some of these points. http://cran.r-project.org/web/packages/caret/index.html