How to preserve id's after data balancing technique like ROSE, SMOTE

拜拜、爱过 提交于 2020-07-09 15:00:47

问题


df1 = data.frame(id=c('A1','2','B3','4','5','6','7','8','9','10'),s1c1=c(0,0.2,0,0.5,0.8,0,0,0,0,0),s1c2=c(0,0,0.3,0,0,0.9,0.3,0,0,0),s1c3=c(0.1,0,0,0,0,0,0,0.2,0.8,0.1))
df2 = data.frame(id=c('A1','2','B3','4','5','6','7','8','9','10'),s2c1=c(0,0.22,0,0.35,0.8,0,0,0,0,0),s2c2=c(0,0,0.23,0,0,0.7,0.3,0,0,0),s2c3=c(0.2,0,0,0,0,0,0,0.4,0.9,0.4))
df <- merge(df1,df2, by="id",all=TRUE)
df$class <- c(0,0,0,0,0,1,1,0,0,0) 
> df
  id s1c1 s1c2 s1c3 s2c1 s2c2 s2c3 class
  10  0.0  0.0  0.1 0.00 0.00  0.4     0
   2  0.2  0.0  0.0 0.22 0.00  0.0     0
   4  0.5  0.0  0.0 0.35 0.00  0.0     0
   5  0.8  0.0  0.0 0.80 0.00  0.0     0
   6  0.0  0.9  0.0 0.00 0.70  0.0     0
   7  0.0  0.3  0.0 0.00 0.30  0.0     1
   8  0.0  0.0  0.2 0.00 0.00  0.4     1
   9  0.0  0.0  0.8 0.00 0.00  0.9     0
  A1  0.0  0.0  0.1 0.00 0.00  0.2     0
  B3  0.0  0.3  0.0 0.00 0.23  0.0     0

I am using ROSE function to generate samples for imbalanced data. But, I want to preserve the id's for each observation from df after ROSE. I am getting below output after using ROSE.

 df.rose <- ROSE(class ~ ., data=df, seed=123,N=20,p=0.25)$data

> df.rose
 id        s1c1         s1c2          s1c3        s2c1         s2c2        s2c3   class
 B3 -0.24636399  0.513435064 -0.0844105623  0.04695640  0.419960189  0.08112992     0
  9 -0.05029030  0.199689698  0.7022285344  0.08255245 -0.133951228  1.16820765     0
  9 -0.23671562  0.167377715  0.9634146745 -0.10923003 -0.129948534  1.00641398     0
 B3 -0.16816685  0.434632663 -0.0174671002 -0.07245581  0.423706144 -0.07969934     0
  9 -0.14420654 -0.015047974  0.8530741203 -0.22148879 -0.053786877  1.18091542     0
  9 -0.38914709 -0.074365870  0.7940190162 -0.23306056 -0.230564666  1.14293933     0
  6  0.19329086  0.807524478 -0.0089820194  0.06600218  0.734243934  0.13409831     0
  6  0.03538563  0.731147735  0.2867432037  0.09746303  0.673766711  0.05837655     0
  4  0.23741363 -0.050535412 -0.0473024899  0.36152575  0.001088718 -0.15354050     0
  2  0.48927513 -0.307561385  0.3177238885  0.42054668  0.072770343  0.33271737     0
 B3  0.09839211  0.827176406 -0.3244875053  0.44579006  0.159991098 -0.14678016     0
 B3 -0.06807770  0.593601657  0.1224855617 -0.10677452  0.351707470  0.53486376     0
  9  0.20651979 -0.272977578  0.8259493668 -0.50212781 -0.041644690  1.27476593     0
  8  0.00000000 -0.008315345  0.0008152742  0.00000000  0.043469230  0.29596908     1
  7  0.00000000  0.155050387 -0.0068404803  0.00000000  0.314397160 -0.50556877     1
  7  0.00000000 -0.008021610  0.0639465277  0.00000000  0.122372337  0.27856790     1
  8  0.00000000 -0.070217063  0.2370763279  0.00000000 -0.013168583  0.04034823     1
  7  0.00000000  0.469712631  0.0130102656  0.00000000  0.566767608  0.18219645     1
  7  0.00000000  0.193749720 -0.0788801623  0.00000000  0.383380004  0.47007644     1
  7  0.00000000  0.412273782 -0.1046108759  0.00000000  0.307614552 -0.35552820     1

I am not getting all id's after ROSE. I want to get my all the id's. If any one know any other method to handle imbalance data by preserving id for each observation. I don't want to messed up id's. I have tried oversampling, undersampling, SMOTE. But, no good results. I have tried converting id column to factor but didn't work.


回答1:


If anyone is still wondering, I ended up using this method. I wanted only the new synthetic observations, but SMOTE kept reducing the size of my dataset. Hope it helps:

library(DMwR)
library(dplyr)

# df - dataframe you want to use over/undersampling on

df$ID <- seq.int(nrow(df))
df_smote <- DMwR::SMOTE(var ~ ., df, perc.over = 100, k = 5)
sub_df <- subset(df_smote, var == "yes")
final_df <- rbind(df, sub_df)
final_df <- distinct(final_df)
  1. Create ID column so it will ensure that rows are exactly the same (not an observation with same set of features)
  2. Use SMOTE with desired parameters (where var is the binary variable on which you have imbalance).
  3. Subset the synthetic observations with var of certain level - in this case "yes" level.
  4. Row bind subset to the original dataset.
  5. Remove duplicates introduced in SMOTE.
  6. And you end up with original dataset with only synthetic observations with desired level over/undersampled.


来源:https://stackoverflow.com/questions/45550848/how-to-preserve-ids-after-data-balancing-technique-like-rose-smote

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!