问题
df1 = data.frame(id=c('A1','2','B3','4','5','6','7','8','9','10'),s1c1=c(0,0.2,0,0.5,0.8,0,0,0,0,0),s1c2=c(0,0,0.3,0,0,0.9,0.3,0,0,0),s1c3=c(0.1,0,0,0,0,0,0,0.2,0.8,0.1))
df2 = data.frame(id=c('A1','2','B3','4','5','6','7','8','9','10'),s2c1=c(0,0.22,0,0.35,0.8,0,0,0,0,0),s2c2=c(0,0,0.23,0,0,0.7,0.3,0,0,0),s2c3=c(0.2,0,0,0,0,0,0,0.4,0.9,0.4))
df <- merge(df1,df2, by="id",all=TRUE)
df$class <- c(0,0,0,0,0,1,1,0,0,0)
> df
id s1c1 s1c2 s1c3 s2c1 s2c2 s2c3 class
10 0.0 0.0 0.1 0.00 0.00 0.4 0
2 0.2 0.0 0.0 0.22 0.00 0.0 0
4 0.5 0.0 0.0 0.35 0.00 0.0 0
5 0.8 0.0 0.0 0.80 0.00 0.0 0
6 0.0 0.9 0.0 0.00 0.70 0.0 0
7 0.0 0.3 0.0 0.00 0.30 0.0 1
8 0.0 0.0 0.2 0.00 0.00 0.4 1
9 0.0 0.0 0.8 0.00 0.00 0.9 0
A1 0.0 0.0 0.1 0.00 0.00 0.2 0
B3 0.0 0.3 0.0 0.00 0.23 0.0 0
I am using ROSE function to generate samples for imbalanced data. But, I want to preserve the id's for each observation from df after ROSE. I am getting below output after using ROSE.
df.rose <- ROSE(class ~ ., data=df, seed=123,N=20,p=0.25)$data
> df.rose
id s1c1 s1c2 s1c3 s2c1 s2c2 s2c3 class
B3 -0.24636399 0.513435064 -0.0844105623 0.04695640 0.419960189 0.08112992 0
9 -0.05029030 0.199689698 0.7022285344 0.08255245 -0.133951228 1.16820765 0
9 -0.23671562 0.167377715 0.9634146745 -0.10923003 -0.129948534 1.00641398 0
B3 -0.16816685 0.434632663 -0.0174671002 -0.07245581 0.423706144 -0.07969934 0
9 -0.14420654 -0.015047974 0.8530741203 -0.22148879 -0.053786877 1.18091542 0
9 -0.38914709 -0.074365870 0.7940190162 -0.23306056 -0.230564666 1.14293933 0
6 0.19329086 0.807524478 -0.0089820194 0.06600218 0.734243934 0.13409831 0
6 0.03538563 0.731147735 0.2867432037 0.09746303 0.673766711 0.05837655 0
4 0.23741363 -0.050535412 -0.0473024899 0.36152575 0.001088718 -0.15354050 0
2 0.48927513 -0.307561385 0.3177238885 0.42054668 0.072770343 0.33271737 0
B3 0.09839211 0.827176406 -0.3244875053 0.44579006 0.159991098 -0.14678016 0
B3 -0.06807770 0.593601657 0.1224855617 -0.10677452 0.351707470 0.53486376 0
9 0.20651979 -0.272977578 0.8259493668 -0.50212781 -0.041644690 1.27476593 0
8 0.00000000 -0.008315345 0.0008152742 0.00000000 0.043469230 0.29596908 1
7 0.00000000 0.155050387 -0.0068404803 0.00000000 0.314397160 -0.50556877 1
7 0.00000000 -0.008021610 0.0639465277 0.00000000 0.122372337 0.27856790 1
8 0.00000000 -0.070217063 0.2370763279 0.00000000 -0.013168583 0.04034823 1
7 0.00000000 0.469712631 0.0130102656 0.00000000 0.566767608 0.18219645 1
7 0.00000000 0.193749720 -0.0788801623 0.00000000 0.383380004 0.47007644 1
7 0.00000000 0.412273782 -0.1046108759 0.00000000 0.307614552 -0.35552820 1
I am not getting all id's after ROSE. I want to get my all the id's. If any one know any other method to handle imbalance data by preserving id for each observation. I don't want to messed up id's. I have tried oversampling, undersampling, SMOTE. But, no good results. I have tried converting id column to factor but didn't work.
回答1:
If anyone is still wondering, I ended up using this method. I wanted only the new synthetic observations, but SMOTE kept reducing the size of my dataset. Hope it helps:
library(DMwR)
library(dplyr)
# df - dataframe you want to use over/undersampling on
df$ID <- seq.int(nrow(df))
df_smote <- DMwR::SMOTE(var ~ ., df, perc.over = 100, k = 5)
sub_df <- subset(df_smote, var == "yes")
final_df <- rbind(df, sub_df)
final_df <- distinct(final_df)
- Create ID column so it will ensure that rows are exactly the same (not an observation with same set of features)
- Use SMOTE with desired parameters (where var is the binary variable on which you have imbalance).
- Subset the synthetic observations with var of certain level - in this case "yes" level.
- Row bind subset to the original dataset.
- Remove duplicates introduced in SMOTE.
- And you end up with original dataset with only synthetic observations with desired level over/undersampled.
来源:https://stackoverflow.com/questions/45550848/how-to-preserve-ids-after-data-balancing-technique-like-rose-smote