XGBoost Categorical Variables: Dummification vs encoding

后端 未结 3 797
面向向阳花
面向向阳花 2021-01-30 00:19

When using XGBoost we need to convert categorical variables into numeric.

Would there be any difference in performance/evaluation metrics between the method

3条回答
  •  佛祖请我去吃肉
    2021-01-30 00:59

    xgboost only deals with numeric columns.

    if you have a feature [a,b,b,c] which describes a categorical variable (i.e. no numeric relationship)

    Using LabelEncoder you will simply have this:

    array([0, 1, 1, 2])
    

    Xgboost will wrongly interpret this feature as having a numeric relationship! This just maps each string ('a','b','c') to an integer, nothing more.

    Proper way

    Using OneHotEncoder you will eventually get to this:

    array([[ 1.,  0.,  0.],
           [ 0.,  1.,  0.],
           [ 0.,  1.,  0.],
           [ 0.,  0.,  1.]])
    

    This is the proper representation of a categorical variable for xgboost or any other machine learning tool.

    Pandas get_dummies is a nice tool for creating dummy variables (which is easier to use, in my opinion).

    Method #2 in above question will not represent the data properly

提交回复
热议问题