发表新帖

发表新帖

XGBoost Categorical Variables: Dummification vs encoding

后端未结

关注

 3  798

面向向阳花 2021-01-30 00:19

When using XGBoost we need to convert categorical variables into numeric.

Would there be any difference in performance/evaluation metrics between the method

3条回答

说谎 (楼主)

2021-01-30 00:46

I want to answer this question not just in terms of XGBoost but in terms of any problem dealing with categorical data. While "dummification" creates a very sparse setup, specially if you have multiple categorical columns with different levels, label encoding is often biased as the mathematical representation is not reflective of the relationship between levels.

For Binary Classification problems, a genius yet unexplored approach which is highly leveraged in traditional credit scoring models is to use Weight of Evidence to replace the categorical levels. Basically every categorical level is replaced by the proportion of Goods/ Proportion of Bads.

Can read more about it here.

Python library here.

This method allows you to capture the "levels" under one column and avoid sparsity or induction of bias that would occur through dummifying or encoding.

Hope this helps !

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题