R caret / rfe variable selection for factors() AND NAs

前端 未结 1 1384
我在风中等你
我在风中等你 2021-01-06 11:47

I have a data set with NAs sprinkled generously throughout.

In addition it has columns that need to be factors().

I am using th

相关标签:
1条回答
  • 2021-01-06 12:08

    Because of inconsistent behavior on these points between packages, not to mention the extra trickiness when going to more "meta" packages like caret, I always find it easier to deal with NAs and factor variables up front, before I do any machine learning.

    • For NAs, either omit or impute (median, knn, etc.).
    • For factor features, you were on the right track with model.matrix(). It will let you generate a series of "dummy" features for the different levels of the factor. The typical usage is something like this:
    > dat = data.frame(x=factor(rep(1:3, each=5)))
    > dat$x
     [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
    Levels: 1 2 3
    > model.matrix(~ x - 1, data=dat)
       x1 x2 x3
    1   1  0  0
    2   1  0  0
    3   1  0  0
    4   1  0  0
    5   1  0  0
    6   0  1  0
    7   0  1  0
    8   0  1  0
    9   0  1  0
    10  0  1  0
    11  0  0  1
    12  0  0  1
    13  0  0  1
    14  0  0  1
    15  0  0  1
    attr(,"assign")
    [1] 1 1 1
    attr(,"contrasts")
    attr(,"contrasts")$x
    [1] "contr.treatment"
    

    Also, just in case you haven't (although it sounds like you have), the caret vignettes on CRAN are very nice and touch on some of these points. http://cran.r-project.org/web/packages/caret/index.html

    0 讨论(0)
提交回复
热议问题