问题
I am trying to run a Random Forest Classifier on an imbalanced dataset (~1:4).
I am using the method from imblearn as follows:
from imblearn.ensemble import BalancedRandomForestClassifier
rf=BalancedRandomForestClassifier(n_estimators=1000,random_state=42,class_weight='balanced',sampling_strategy='not minority')
rf.fit(train_features,train_labels)
predictions=rf.predict(test_features)
The split in training and test set is performed within a cross-validation approach using RepeatedStratifiedKFold
from scikit learn.
However, I wonder if the test set needs to be balanced as well in order to obtain sensible accuracy scores (sensitivity, specificity etc.). I hope you can help me with this.
Many thanks!
回答1:
From the imblearn docs:
A balanced random forest randomly under-samples each bootstrap sample to balance it.
If you are okay with random undersampling as your balancing method, then the classifier is doing that for you "under the hood". In fact, that's the point of using imblearn in the first place, to handle class imbalance. If you were using a straight random forest, like the out-of-the-box version from sklearn, then I would be more concerned about dealing with class imbalance on the front end.
来源:https://stackoverflow.com/questions/54910960/random-forest-balancing-test-set