Random forest: balancing test set?

问题

I am trying to run a Random Forest Classifier on an imbalanced dataset (~1:4).

I am using the method from imblearn as follows:

from imblearn.ensemble import BalancedRandomForestClassifier

rf=BalancedRandomForestClassifier(n_estimators=1000,random_state=42,class_weight='balanced',sampling_strategy='not minority')
rf.fit(train_features,train_labels) 
predictions=rf.predict(test_features)

The split in training and test set is performed within a cross-validation approach using RepeatedStratifiedKFold from scikit learn.

However, I wonder if the test set needs to be balanced as well in order to obtain sensible accuracy scores (sensitivity, specificity etc.). I hope you can help me with this.

Many thanks!

回答1:

From the imblearn docs:

A balanced random forest randomly under-samples each bootstrap sample to balance it.

If you are okay with random undersampling as your balancing method, then the classifier is doing that for you "under the hood". In fact, that's the point of using imblearn in the first place, to handle class imbalance. If you were using a straight random forest, like the out-of-the-box version from sklearn, then I would be more concerned about dealing with class imbalance on the front end.

来源：https://stackoverflow.com/questions/54910960/random-forest-balancing-test-set

标签

python

random-forest

imblearn

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!