How to handle categorical variables in sklearn GradientBoostingClassifier?

前端未结

关注

 2  912

孤城傲影

I am attempting to train models with GradientBoostingClassifier using categorical variables.

The following is a primitive code sample, just for trying to input categori

相关标签:

2条回答

Happy的楠姐

2021-02-04 12:48

pandas.get_dummies or statsmodels.tools.tools.categorical can be used to convert categorical variables to a dummy matrix. We can then merge the dummy matrix back to the training data.

Below is the example code from the question with the above procedure carried out.

from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve,auc
from statsmodels.tools import categorical
import numpy as np

iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]

# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]


###########################################################################
###### Convert categorical variable to matrix and merge back with training
###### data.

# Fake categorical variable.
catVar = np.array(['a']*40 + ['b']*40)
catVar = categorical(catVar, drop=True)
X_train = np.concatenate((X_train, catVar), axis = 1)

catVar = np.array(['a']*10 + ['b']*10)
catVar = categorical(catVar, drop=True)
X_test = np.concatenate((X_test, catVar), axis = 1)
###########################################################################

# Model and test.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)

prob = clf.predict_proba(X_test)[:,1]   # Only look at P(y==1).

fpr, tpr, thresholds = roc_curve(y_test, prob)
roc_auc_prob = auc(fpr, tpr)

print(prob)
print(y_test)
print(roc_auc_prob)

Thanks to Andreas Muller for instructing that pandas Dataframe should not be used for scikit-learn estimators.

0 讨论(0)

执笔经年

2021-02-04 12:50

Sure it can handle it, you just have to encode the categorical variables as a separate step on the pipeline. Sklearn is perfectly capable of handling categorical variables as well as R or any other ML package. The R package is still (presumably) doing one-hot encoding behind the scenes, it just doesn't separate the concerns of encoding and fitting in this case (as it arguably should).

0 讨论(0)
发布评论:

提交评论
- 加载中...