I want to use VotingClassifier, but I have some problems with cross validating
x_train, x_validation, y_train, y_validation = train_test_split(x, y, test
This error is because of this line:
np.bincount(x, weights=self._weights_not_none)
Here x
is the predictions returned by the individual classifiers inside the VotingClassifier.
According to the documentation of np.bincount:
Count number of occurrences of each value in array of non-negative ints.
x : array_like, 1 dimension, nonnegative ints
This method requires only int values in the array.
Now your code will work if you replace the CatBoostClassifier with any other Scikit-learn classifier. Because all scikit-learn estimators return array of np.int64
from their predict()
.
But CatBoostClassifier returns np.float64
as the output. And hence the error. Actually it should also return int64 because the predict()
function should return the classes not any float values. But I dont know why it returns float.
You can correct this by extending the CatBoostClassifier
class and converting the predictions on the fly.
import numpy as np
from catboost import CatBoostClassifier
class CatBoostClassifierInt(CatBoostClassifier):
def predict(self, data, prediction_type='Class', ntree_start=0, ntree_end=0, thread_count=1, verbose=None):
predictions = self._predict(data, prediction_type, ntree_start, ntree_end, thread_count, verbose)
# This line is the only change I did
return np.asarray(predictions, dtype=np.int64).ravel()
clf1 = CatBoostClassifierInt()
clf2 = RandomForestClassifier()
clf = VotingClassifier(estimators=[('cb', clf1), ('rf', clf2)])
cross_validate(clf, x_train, y_train, scoring='accuracy', return_train_score = True)
Now you wont get that error.
More correct version should be this. This will handle all the types of labels with matching input and output and can be used in scikit with ease:
class CatBoostClassifierCorrected(CatBoostClassifier):
def fit(self, X, y=None, cat_features=None, sample_weight=None, baseline=None, use_best_model=None,
eval_set=None, verbose=None, logging_level=None, plot=False, column_description=None, verbose_eval=None):
self.le_ = LabelEncoder().fit(y)
transformed_y = self.le_.transform(y)
self._fit(X, transformed_y, cat_features, None, sample_weight, None, None, None, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval)
return self
def predict(self, data, prediction_type='Class', ntree_start=0, ntree_end=0, thread_count=1, verbose=None):
predictions = self._predict(data, prediction_type, ntree_start, ntree_end, thread_count, verbose)
# This line is the only change I did
return self.le_.inverse_transform(predictions.astype(np.int64))
This will handle all different types of labels