Take randomly sample based on groups

前端 未结 8 640
说谎
说谎 2020-11-28 13:23

I have a df made by almost 50,000 rows spread in 15 different IDs (every ID has thousands of observations). df looks like:

        ID  Year    Temp    ph
1           


        
相关标签:
8条回答
  • 2020-11-28 13:47

    Although this is not very elegant solution, but it may work.

    library(data.table)
    df <- data.table(df)
    f <- list()
    for(i in unique(df1$ID)){
     f[[i]] <- df1[id == i][sample(.N,(500))]
      }
     dfnew <- rbindlist(f)
    
    0 讨论(0)
  • 2020-11-28 13:54
    library(data.table) #1
    df <- data.table(df) #2
    df[,group_num := sample(2,.N,replace = TRUE,prob = c(500,.N-500)/.N),by = "ID"] #3
    df_sample = df[group_num == 1,] #4
    

    or you can change line #3 and #4 to:

    df[,random_num := sample(.N,.N),by="ID"]
    df_sample  = df[random_num <=500,]
    
    0 讨论(0)
  • 2020-11-28 13:57

    In case you have big datasets, a data.table solution could go like this:

    library(data.table)
    
    # Generate 26 mil rows random data
    set.seed(2019)
    dt <- data.table(c1 = sample(length(LETTERS)*10^6), 
                     c2 = sample(LETTERS, replace = TRUE))
    
    # For each letter, sample 500 rows
    dt_sample <- dt[, .SD[sample(x = .N, size = 500)], by = c2]
    
    # We indeed sampled 500 rows for each letter
    dt_sample[, .N, by = c2][order(c2)]
    #>     c2   N
    #>  1:  A 500
    #>  2:  D 500
    #>  3:  G 500
    #>  4:  I 500
    #>  5:  M 500
    #>  6:  N 500
    #>  7:  O 500
    #>  8:  P 500
    #>  9:  Q 500
    #> 10:  R 500
    #> 11:  S 500
    #> 12:  T 500
    #> 13:  U 500
    #> 14:  V 500
    #> 15:  W 500
    #> 16:  Y 500
    #> 17:  Z 500
    

    Created on 2019-04-23 by the reprex package (v0.2.1)

    In case your data is unbalanced in the sense that some groups happen to be smaller (as number of rows) than your desired sample size, then you need to set a defensive trick like sample size should be min(500, .N) - see sample random rows within each group in a data.table. So like:

    dt[, .SD[sample(x = .N, size = min(500, .N))], by = c2]

    0 讨论(0)
  • 2020-11-28 13:59

    Try this:

    library(plyr)
    ddply(df,.(ID),function(x) x[sample(nrow(x),500),])
    
    0 讨论(0)
  • 2020-11-28 14:07

    An approach if on of the IDs is < 500. Here I used the mtcars set:

    n <- 8
    df <- mtcars
    df$ID <- df$cyl
    
    FUN <- function(x, n) {
        if (length(x) <= n) return(x)
        x[x %in% sample(x, n)]
    }
    
    df[unlist(lapply(split(1:nrow(df), df$ID), FUN, n = 8)), ]
    
    0 讨论(0)
  • 2020-11-28 14:08

    This is available as the sample_n function in dplyr:

    library(dplyr)
    new_df <- df %>% group_by(ID) %>% sample_n(500)
    
    0 讨论(0)
提交回复
热议问题