发表新帖

发表新帖

XGBoost Categorical Variables: Dummification vs encoding

后端未结

关注

 3  797

面向向阳花 2021-01-30 00:19

When using XGBoost we need to convert categorical variables into numeric.

Would there be any difference in performance/evaluation metrics between the method

3条回答

佛祖请我去吃肉 (楼主)

2021-01-30 00:59
xgboost only deals with numeric columns.

if you have a feature [a,b,b,c] which describes a categorical variable (i.e. no numeric relationship)

Using LabelEncoder you will simply have this:
```
array([0, 1, 1, 2])
```
Xgboost will wrongly interpret this feature as having a numeric relationship! This just maps each string ('a','b','c') to an integer, nothing more.

Proper way

Using OneHotEncoder you will eventually get to this:
```
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])
```
This is the proper representation of a categorical variable for xgboost or any other machine learning tool.

Pandas get_dummies is a nice tool for creating dummy variables (which is easier to use, in my opinion).

Method #2 in above question will not represent the data properly
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题