Selecting random rows by category from a data frame?

前端未结

关注

 3  1371

一个人的身影 2021-01-22 17:06

I have a data frame as follows:

Category Name Value

How would I select say, 5 random names per category? Using sample returns random ro

3条回答

鱼传尺愫 (楼主)

2021-01-22 17:39

In the past, I've used a little wrapper I wrote for some of the functions from the "sampling" package.

Here's the function:

strata.sampling <- function(data, group, size, method = NULL) {
  #  USE: 
  #   * Specify a data.frame and grouping variable.
  #   * Decide on your sample size. For a sample proportional to the 
  #     population, enter "size" as a decimal. For an equal number of 
  #     samples from each group, enter "size" as a whole number. For
  #     a specific number of samples from each group, enter the numbers
  #     required as a vector.

  require(sampling)
  if (is.null(method)) method <- "srswor"
  if (!method %in% c("srswor", "srswr")) 
    stop('method must be "srswor" or "srswr"')
  temp <- data[order(data[[group]]), ]
  ifelse(length(size) > 1,
         size <- size, 
         ifelse(size < 1,
                size <- round(table(temp[group]) * size),
                size <- rep(size, times=length(table(temp[group])))))
  strat = strata(temp, stratanames = names(temp[group]), 
                 size = size, method = method)
  getdata(temp, strat)
}

Here's how you can use it:

# Sample data --- Note each category has a different number of observations
df <- data.frame(Category = rep(1:5, times = c(40, 15, 7, 13, 25)), 
                 Name = 1:100, Value = rnorm(100))

# Sample 5 from each "Category" group
strata.sampling(df, "Category", 5)
# Sample 2 from the first category, 3 from the next, and so on
strata.sampling(df, "Category", c(2, 3, 4, 5, 2))
# Sample 15% from each group
strata.sampling(df, "Category", .15)

There is also an enhanced function I wrote here. That function gracefully handles cases where a group might have fewer observations than the specified number of samples, and also lets you stratify by multiple variables. See the docs for several examples.

0 讨论(0)

查看其它3个回答