What is the correct procedure to split the Data sets for classification problem?

坚强是说给别人听的谎言 提交于 2019-11-29 18:07:26

TLDR: Try both!


I have been in similar situations before where my dataset was imbalanced. I used train_test_split or KFold to get through.

However, once I stumbled upon the problem of handling imbalanced datasets and came across the techniques of overbalancing and underbalancing. To do this, I would recommend using the library: imblearn

You will find various techniques there to handle the cases where one of your classes outnumbers the other one. I personally have used SMOTE a lot and have had relatively better success in such cases.


Other references:

https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/

https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28

You can use the stratify option in train test split, which splits each class on the mentioned test size.

x_train,y_train,x_test,y_test = train_test_split(X,y,test_size=0.2,stratify=y)

I am working on project where I am experimenting with credit dataset(imbalanced dataset containing 1% of a minority class and 99% of the majority class) for fraud detection using different sampling method and found that SMOTE gives better results with imbalanced datasets.

SMOTE (Synthetic Minority Oversampling Technique) is a powerful sampling method that goes beyond simple under or over sampling. This algorithm creates new instances of the minority class by creating convex combinations of neighbouring instances

I have used SMOTE sampling methods along with the K-Fold cross validation. Cross validation technique assures that model gets the correct patterns from the data, and it is not getting up too much noise.

In case of imbalanced dataset, accuracy score of sampling algorithm yields an accuracy of 99% which seems impressive, but minority class could be totally ignored in case of imbalanced datasets. So, I have used Matthew Coefficient Correlation Score, F1 Score measuring algorithm in addition to Accuracy for performance measurement on an Imbalanced Dataset.

Code :

from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

sm = SMOTE(random_state=2) X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())

References :

https://www.kaggle.com/qianchao/smote-with-imbalance-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!