Can sklearn random forest directly handle categorical features?

后端 未结 3 542
予麋鹿
予麋鹿 2020-12-04 11:35

Say I have a categorical feature, color, which takes the values

[\'red\', \'blue\', \'green\', \'orange\'],

and I want to use it to predict something in a ra

相关标签:
3条回答
  • 2020-12-04 12:04

    Most implementations of random forest (and many other machine learning algorithms) that accept categorical inputs are either just automating the encoding of categorical features for you or using a method that becomes computationally intractable for large numbers of categories.

    A notable exception is H2O. H2O has a very efficient method for handling categorical data directly which often gives it an edge over tree based methods that require one-hot-encoding.

    This article by Will McGinnis has a very good discussion of one-hot-encoding and alternatives.

    This article by Nick Dingwall and Chris Potts has a very good discussion about categorical variables and tree based learners.

    0 讨论(0)
  • 2020-12-04 12:04

    You have to make the categorical variable into a series of dummy variables. Yes I know its annoying and seems unnecessary but that is how sklearn works. if you are using pandas. use pd.get_dummies, it works really well.

    0 讨论(0)
  • 2020-12-04 12:09

    No, there isn't. Somebody's working on this and the patch might be merged into mainline some day, but right now there's no support for categorical variables in scikit-learn except dummy (one-hot) encoding.

    0 讨论(0)
提交回复
热议问题