oversampling

R data.table - sample by group with different sampling proportion

那年仲夏 提交于 2021-02-05 11:14:03
问题 I would like to efficiently make a random sample by group from a data.table , but it should be possible to sample a different proportion for each group. If I wanted to sample fraction sampling_fraction from each group, i could get inspired by this question and related answer to do something like: DT = data.table(a = sample(1:2), b = sample(1:1000,20)) group_sampler <- function(data, group_col, sample_fraction){ # this function samples sample_fraction <0,1> from each group in the data.table #

R data.table - sample by group with different sampling proportion

旧巷老猫 提交于 2021-02-05 11:13:02
问题 I would like to efficiently make a random sample by group from a data.table , but it should be possible to sample a different proportion for each group. If I wanted to sample fraction sampling_fraction from each group, i could get inspired by this question and related answer to do something like: DT = data.table(a = sample(1:2), b = sample(1:1000,20)) group_sampler <- function(data, group_col, sample_fraction){ # this function samples sample_fraction <0,1> from each group in the data.table #

Function for cross validation and oversampling (SMOTE)

佐手、 提交于 2020-01-14 05:34:07
问题 I wrote the below code. X is a dataframe with the shape (1000,5) and y is a dataframe with shape (1000,1) . y is the target data to predict, and it is imbalanced. I want to apply cross validation and SMOTE. def Learning(n, est, X, y): s_k_fold = StratifiedKFold(n_splits = n) acc_scores = [] rec_scores = [] f1_scores = [] for train_index, test_index in s_k_fold.split(X, y): X_train = X[train_index] y_train = y[train_index] sm = SMOTE(random_state=42) X_resampled, y_resampled = sm.fit_resample

Over-Sampling Class Imbalance Train/Test Split “Found input variables with inconsistent numbers of samples” Solution?

旧城冷巷雨未停 提交于 2020-01-06 02:25:46
问题 Trying to follow this article to perform over-sampling for imbalanced classification. My class ratio is about 8:1. https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook I am confused on the pipeline + coding structure. Should you over-sample after train/test splitting? If so, how do you deal with the fact that the target label is dropped from X? I tried keeping it and then performed the over-sampling then dropped labels on X_train/X_test and replaced the new

SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors

大兔子大兔子 提交于 2019-12-30 11:28:05
问题 I have already pre-cleaned the data, and below shows the format of the top 4 rows: [IN] df.head() [OUT] Year cleaned 0 1909 acquaint hous receiv follow letter clerk crown... 1 1909 ask secretari state war whether issu statement... 2 1909 i beg present petit sign upward motor car driv... 3 1909 i desir ask secretari state war second lieuten... 4 1909 ask secretari state war whether would introduc... I have called train_test_split() as follows: [IN] X_train, X_test, y_train, y_test = train_test

SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors

霸气de小男生 提交于 2019-12-30 11:28:03
问题 I have already pre-cleaned the data, and below shows the format of the top 4 rows: [IN] df.head() [OUT] Year cleaned 0 1909 acquaint hous receiv follow letter clerk crown... 1 1909 ask secretari state war whether issu statement... 2 1909 i beg present petit sign upward motor car driv... 3 1909 i desir ask secretari state war second lieuten... 4 1909 ask secretari state war whether would introduc... I have called train_test_split() as follows: [IN] X_train, X_test, y_train, y_test = train_test

Using Smote with Gridsearchcv in Scikit-learn

百般思念 提交于 2019-12-29 01:35:07
问题 I'm dealing with an imbalanced dataset and want to do a grid search to tune my model's parameters using scikit's gridsearchcv. To oversample the data, I want to use SMOTE, and I know I can include that as a stage of a pipeline and pass it to gridsearchcv. My concern is that I think smote will be applied to both train and validation folds, which is not what you are supposed to do. The validation set should not be oversampled. Am I right that the whole pipeline will be applied to both dataset

Oversampling or SMOTE in Pyspark

瘦欲@ 提交于 2019-12-12 17:14:51
问题 I have 7 classes and the total number of records are 115 and I wanted to run Random Forest model over this data. But as the data is not enough to get a high accuracy. So i wanted to apply oversampling over all the classes in a way that the majority class itself get higher count and then minority accordingly. Is this possible in PySpark? +---------+-----+ | SubTribe|count| +---------+-----+ | Chill| 10| | Cool| 18| |Adventure| 18| | Quirk| 13| | Mystery| 25| | Party| 18| |Glamorous| 13| +-----

Duplicating training examples to handle class imbalance in a pandas data frame

て烟熏妆下的殇ゞ 提交于 2019-11-30 19:45:55
I have a DataFrame in pandas that contain training examples, for example: feature1 feature2 class 0 0.548814 0.791725 1 1 0.715189 0.528895 0 2 0.602763 0.568045 0 3 0.544883 0.925597 0 4 0.423655 0.071036 0 5 0.645894 0.087129 0 6 0.437587 0.020218 0 7 0.891773 0.832620 1 8 0.963663 0.778157 0 9 0.383442 0.870012 0 which I generated using: import pandas as pd import numpy as np np.random.seed(0) number_of_samples = 10 frame = pd.DataFrame({ 'feature1': np.random.random(number_of_samples), 'feature2': np.random.random(number_of_samples), 'class': np.random.binomial(2, 0.1, size=number_of

Duplicating training examples to handle class imbalance in a pandas data frame

 ̄綄美尐妖づ 提交于 2019-11-30 04:14:10
问题 I have a DataFrame in pandas that contain training examples, for example: feature1 feature2 class 0 0.548814 0.791725 1 1 0.715189 0.528895 0 2 0.602763 0.568045 0 3 0.544883 0.925597 0 4 0.423655 0.071036 0 5 0.645894 0.087129 0 6 0.437587 0.020218 0 7 0.891773 0.832620 1 8 0.963663 0.778157 0 9 0.383442 0.870012 0 which I generated using: import pandas as pd import numpy as np np.random.seed(0) number_of_samples = 10 frame = pd.DataFrame({ 'feature1': np.random.random(number_of_samples),