How to preserve id's after data balancing technique like ROSE, SMOTE

问题

df1 = data.frame(id=c('A1','2','B3','4','5','6','7','8','9','10'),s1c1=c(0,0.2,0,0.5,0.8,0,0,0,0,0),s1c2=c(0,0,0.3,0,0,0.9,0.3,0,0,0),s1c3=c(0.1,0,0,0,0,0,0,0.2,0.8,0.1))
df2 = data.frame(id=c('A1','2','B3','4','5','6','7','8','9','10'),s2c1=c(0,0.22,0,0.35,0.8,0,0,0,0,0),s2c2=c(0,0,0.23,0,0,0.7,0.3,0,0,0),s2c3=c(0.2,0,0,0,0,0,0,0.4,0.9,0.4))
df <- merge(df1,df2, by="id",all=TRUE)
df$class <- c(0,0,0,0,0,1,1,0,0,0) 
> df
  id s1c1 s1c2 s1c3 s2c1 s2c2 s2c3 class
  10  0.0  0.0  0.1 0.00 0.00  0.4     0
   2  0.2  0.0  0.0 0.22 0.00  0.0     0
   4  0.5  0.0  0.0 0.35 0.00  0.0     0
   5  0.8  0.0  0.0 0.80 0.00  0.0     0
   6  0.0  0.9  0.0 0.00 0.70  0.0     0
   7  0.0  0.3  0.0 0.00 0.30  0.0     1
   8  0.0  0.0  0.2 0.00 0.00  0.4     1
   9  0.0  0.0  0.8 0.00 0.00  0.9     0
  A1  0.0  0.0  0.1 0.00 0.00  0.2     0
  B3  0.0  0.3  0.0 0.00 0.23  0.0     0

I am using ROSE function to generate samples for imbalanced data. But, I want to preserve the id's for each observation from df after ROSE. I am getting below output after using ROSE.

 df.rose <- ROSE(class ~ ., data=df, seed=123,N=20,p=0.25)$data

> df.rose
 id        s1c1         s1c2          s1c3        s2c1         s2c2        s2c3   class
 B3 -0.24636399  0.513435064 -0.0844105623  0.04695640  0.419960189  0.08112992     0
  9 -0.05029030  0.199689698  0.7022285344  0.08255245 -0.133951228  1.16820765     0
  9 -0.23671562  0.167377715  0.9634146745 -0.10923003 -0.129948534  1.00641398     0
 B3 -0.16816685  0.434632663 -0.0174671002 -0.07245581  0.423706144 -0.07969934     0
  9 -0.14420654 -0.015047974  0.8530741203 -0.22148879 -0.053786877  1.18091542     0
  9 -0.38914709 -0.074365870  0.7940190162 -0.23306056 -0.230564666  1.14293933     0
  6  0.19329086  0.807524478 -0.0089820194  0.06600218  0.734243934  0.13409831     0
  6  0.03538563  0.731147735  0.2867432037  0.09746303  0.673766711  0.05837655     0
  4  0.23741363 -0.050535412 -0.0473024899  0.36152575  0.001088718 -0.15354050     0
  2  0.48927513 -0.307561385  0.3177238885  0.42054668  0.072770343  0.33271737     0
 B3  0.09839211  0.827176406 -0.3244875053  0.44579006  0.159991098 -0.14678016     0
 B3 -0.06807770  0.593601657  0.1224855617 -0.10677452  0.351707470  0.53486376     0
  9  0.20651979 -0.272977578  0.8259493668 -0.50212781 -0.041644690  1.27476593     0
  8  0.00000000 -0.008315345  0.0008152742  0.00000000  0.043469230  0.29596908     1
  7  0.00000000  0.155050387 -0.0068404803  0.00000000  0.314397160 -0.50556877     1
  7  0.00000000 -0.008021610  0.0639465277  0.00000000  0.122372337  0.27856790     1
  8  0.00000000 -0.070217063  0.2370763279  0.00000000 -0.013168583  0.04034823     1
  7  0.00000000  0.469712631  0.0130102656  0.00000000  0.566767608  0.18219645     1
  7  0.00000000  0.193749720 -0.0788801623  0.00000000  0.383380004  0.47007644     1
  7  0.00000000  0.412273782 -0.1046108759  0.00000000  0.307614552 -0.35552820     1

I am not getting all id's after ROSE. I want to get my all the id's. If any one know any other method to handle imbalance data by preserving id for each observation. I don't want to messed up id's. I have tried oversampling, undersampling, SMOTE. But, no good results. I have tried converting id column to factor but didn't work.

回答1:

If anyone is still wondering, I ended up using this method. I wanted only the new synthetic observations, but SMOTE kept reducing the size of my dataset. Hope it helps:

library(DMwR)
library(dplyr)

# df - dataframe you want to use over/undersampling on

df$ID <- seq.int(nrow(df))
df_smote <- DMwR::SMOTE(var ~ ., df, perc.over = 100, k = 5)
sub_df <- subset(df_smote, var == "yes")
final_df <- rbind(df, sub_df)
final_df <- distinct(final_df)

Create ID column so it will ensure that rows are exactly the same (not an observation with same set of features)

Use SMOTE with desired parameters (where var is the binary variable on which you have imbalance).

Subset the synthetic observations with var of certain level - in this case "yes" level.

Row bind subset to the original dataset.

Remove duplicates introduced in SMOTE.

And you end up with original dataset with only synthetic observations with desired level over/undersampled.

来源：https://stackoverflow.com/questions/45550848/how-to-preserve-ids-after-data-balancing-technique-like-rose-smote

标签

machine-learning

classification

resampling