imbalanced-data

SMOTE is giving array size / ValueError for all-categorical dataset

自闭症网瘾萝莉.ら 提交于 2021-02-08 07:39:56
问题 I am using SMOTE-NC for oversampling my categorical data. I have only 1 feature and 10500 samples. While running the below code, I am getting the error: --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-151-a261c423a6d8> in <module>() 16 print(X_new.shape) # (10500, 1) 17 print(X_new) ---> 18 sm.fit_sample(X_new, Y_new) ~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\imblearn\base.py

All probability values are less than 0.5 on unseen data

女生的网名这么多〃 提交于 2021-01-28 23:38:14
问题 I have 15 features with a binary response variable and I am interested in predicting probabilities than 0 or 1 class labels. When I trained and tested the RF model with 500 trees, CV, balanced class weight, and balanced samples in the data frame, I achieved a good amount of accuracy and also good Brier score. As you can see in the image, the predicted probabilities values of class 1 on test data are in between 0 to 1. Here is the Histogram of predicted probabilities on test data: with

All probability values are less than 0.5 on unseen data

[亡魂溺海] 提交于 2021-01-28 23:31:52
问题 I have 15 features with a binary response variable and I am interested in predicting probabilities than 0 or 1 class labels. When I trained and tested the RF model with 500 trees, CV, balanced class weight, and balanced samples in the data frame, I achieved a good amount of accuracy and also good Brier score. As you can see in the image, the predicted probabilities values of class 1 on test data are in between 0 to 1. Here is the Histogram of predicted probabilities on test data: with

All probability values are less than 0.5 on unseen data

泄露秘密 提交于 2021-01-28 23:25:13
问题 I have 15 features with a binary response variable and I am interested in predicting probabilities than 0 or 1 class labels. When I trained and tested the RF model with 500 trees, CV, balanced class weight, and balanced samples in the data frame, I achieved a good amount of accuracy and also good Brier score. As you can see in the image, the predicted probabilities values of class 1 on test data are in between 0 to 1. Here is the Histogram of predicted probabilities on test data: with

Process for oversampling data for imbalanced binary classification

扶醉桌前 提交于 2020-08-17 11:14:08
问题 I have about a 30% and 70% for class 0 (minority class) and class 1 (majority class). Since I do not have a lot of data, I am planning to oversample the minority class to balance out the classes to become a 50-50 split. I was wondering if oversampling should be done before or after splitting my data into train and test sets. I have generally seen it done before splitting in online examples, like this: df_class0 = train[train.predict_var == 0] df_class1 = train[train.predict_var == 1] df

Process for oversampling data for imbalanced binary classification

∥☆過路亽.° 提交于 2020-08-17 11:14:05
问题 I have about a 30% and 70% for class 0 (minority class) and class 1 (majority class). Since I do not have a lot of data, I am planning to oversample the minority class to balance out the classes to become a 50-50 split. I was wondering if oversampling should be done before or after splitting my data into train and test sets. I have generally seen it done before splitting in online examples, like this: df_class0 = train[train.predict_var == 0] df_class1 = train[train.predict_var == 1] df

Process for oversampling data for imbalanced binary classification

余生长醉 提交于 2020-08-17 11:11:30
问题 I have about a 30% and 70% for class 0 (minority class) and class 1 (majority class). Since I do not have a lot of data, I am planning to oversample the minority class to balance out the classes to become a 50-50 split. I was wondering if oversampling should be done before or after splitting my data into train and test sets. I have generally seen it done before splitting in online examples, like this: df_class0 = train[train.predict_var == 0] df_class1 = train[train.predict_var == 1] df

Correct way to do cross validation in a pipeline with imbalanced data

泪湿孤枕 提交于 2020-06-27 17:20:20
问题 For the given imbalanced data , I have created a different pipelines for standardization & one hot encoding numeric_transformer = Pipeline(steps = [('scaler', StandardScaler())]) categorical_transformer = Pipeline(steps=['ohe', OneHotCategoricalEncoder()]) After that a column transformer keeping the above pipelines in one from sklearn.compose import ColumnTransformer preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical

Multilabel classification with class imbalance in Pytorch

风流意气都作罢 提交于 2020-06-27 09:59:06
问题 I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130. The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a