Selecting random rows by category from a data frame?

前端 未结 3 1365
一个人的身影
一个人的身影 2021-01-22 17:06

I have a data frame as follows:

Category Name Value

How would I select say, 5 random names per category? Using sample returns random ro

相关标签:
3条回答
  • If you want the same number of items from each category, this is easy:

    df[unlist(tapply(1:nrow(df),df$Category,function(x) sample(x,3))),]
    

    e.g., I generated df as follows:

    df <- data.frame(Category=rep(1:5,each=20),Name=1:100,Value=rnorm(100))
    

    then I get the follow from my code:

    > df[unlist(tapply(1:nrow(df),df$Category,function(x) sample(x,3))),]
        Category Name       Value
    5          1    5  0.25151044
    20         1   20  1.52486482
    18         1   18  0.69313462
    30         2   30  0.73444185
    27         2   27  0.24000427
    39         2   39 -0.10108203
    46         3   46 -0.37200574
    49         3   49 -1.84920469
    43         3   43  0.35976388
    68         4   68  0.57879516
    76         4   76 -0.11049302
    64         4   64 -0.13471303
    100        5  100  0.95979408
    95         5   95 -0.01928741
    99         5   99  0.85725242
    

    If you want different numbers of rows from each category it will be more complicated.

    0 讨论(0)
  • 2021-01-22 17:19

    Best guess in absence of test cases:

      do.call( rbind, lapply( split(dfrm, df$cat) ,
                             function(df) df[sample(nrow(df), 5) , ] )
              )
    

    Tested with Jonathan's data:

    > do.call( rbind, lapply( split(df, df$Category) ,
    +                          function(df) df[sample(nrow(df), 5) , ] )
    +           )
    
          Category Name      Value   
    1.8          1    8 -0.2496109   #  useful side-effect of labeling source group
    1.15         1   15 -0.4037368
    1.17         1   17 -0.4223724
    1.12         1   12 -0.9359026
    1.18         1   18  0.3741184
    2.37         2   37  0.3033610
    2.34         2   34 -0.4517738
    2.36         2   36 -0.7695923
    snipped remainder
    
    0 讨论(0)
  • 2021-01-22 17:39

    In the past, I've used a little wrapper I wrote for some of the functions from the "sampling" package.

    Here's the function:

    strata.sampling <- function(data, group, size, method = NULL) {
      #  USE: 
      #   * Specify a data.frame and grouping variable.
      #   * Decide on your sample size. For a sample proportional to the 
      #     population, enter "size" as a decimal. For an equal number of 
      #     samples from each group, enter "size" as a whole number. For
      #     a specific number of samples from each group, enter the numbers
      #     required as a vector.
    
      require(sampling)
      if (is.null(method)) method <- "srswor"
      if (!method %in% c("srswor", "srswr")) 
        stop('method must be "srswor" or "srswr"')
      temp <- data[order(data[[group]]), ]
      ifelse(length(size) > 1,
             size <- size, 
             ifelse(size < 1,
                    size <- round(table(temp[group]) * size),
                    size <- rep(size, times=length(table(temp[group])))))
      strat = strata(temp, stratanames = names(temp[group]), 
                     size = size, method = method)
      getdata(temp, strat)
    }
    

    Here's how you can use it:

    # Sample data --- Note each category has a different number of observations
    df <- data.frame(Category = rep(1:5, times = c(40, 15, 7, 13, 25)), 
                     Name = 1:100, Value = rnorm(100))
    
    # Sample 5 from each "Category" group
    strata.sampling(df, "Category", 5)
    # Sample 2 from the first category, 3 from the next, and so on
    strata.sampling(df, "Category", c(2, 3, 4, 5, 2))
    # Sample 15% from each group
    strata.sampling(df, "Category", .15)
    

    There is also an enhanced function I wrote here. That function gracefully handles cases where a group might have fewer observations than the specified number of samples, and also lets you stratify by multiple variables. See the docs for several examples.

    0 讨论(0)
提交回复
热议问题