Draw a random sample without replacement based on a strict range in R

问题

I'm trying to draw a random sample of rows without replacement from a dataset such that the sum of a column in the sample should be strictly within a range. For the example dataset mtcars, the random sample should be such that the sum of mpg is strictly within 90-100.

A reproducible example:

data("mtcars")

random_sample <- function(dataset){
  final_mpg = 0
  while (final_mpg < 100) {
    basic_dat <- dataset %>%
      sample_n(1) %>%
      ungroup()
    total_mpg <- basic_dat %>%
      summarise(mpg = sum(mpg)) %>%
      pull(mpg)
    final_mpg <- final_mpg + total_mpg
    if (final_mpg > 90 & final_mpg < 100){
      break()
    }
    final_dat <- rbind(get0("final_dat"), get0("basic_dat"))
  }
  return(final_dat)
}

chosen_sample <- random_sample(mtcars)

But this function output samples with sum(mpg) > 100. How do I ensure that every sample it generates is strictly within that range? Any help is much appreciated.

回答1:

Here's a hack, though realize that there's never a guarantee that it'll work.

#' Random sampling of data
#'
#' Return a sample of the dataset's rows where the sum of 'fld' values
#' is between the two numbers of 'sumbetween'.
#'
#' @param dat data.frame
#' @param fld character, the name of one of the fields in 'dat'
#' @param sumbetween numeric, length 2, the two ends of the range of
#'   desired sum
#' @param suggestn integer, a suggestion for 'n' around which sample
#'   sizes are based; the actual samples attempted will vary between
#'   0.5 and 1.5 times this value; if 'NA' (the default), then it
#'   defaults naively to 'mean(sumbetween) / median(dat[[fld]])'
#' @param iters integer, number of samples to attempt before
#'   "giving up" (otherwise this might run forever)
#' @return data.frame, a sample of the original dataset; regardless of
#'   success, two attributes are included, 'mu' and 'sigma',
#'   indicating the mean and standard deviation of the samples tested
random_sample <- function(dat, fld, sumbetween, suggestn = NA, iters = 100) {
  stopifnot(fld %in% names(dat), is.numeric(dat[[fld]]), is.numeric(sumbetween))

  if (is.na(suggestn)) {
    suggestn <- mean(sumbetween) / median(dat[[fld]])
  }
  suggestn <- min(suggestn, nrow(dat))

  mu <- NA
  Sn <- 0
  ind <- FALSE
  n <- 0L

  while ((is.na(iters) || n < iters) && !ind) {
    n <- n + 1L
    size <- min(nrow(dat), sample(seq(max(1, floor(suggestn/2)), ceiling(suggestn*1.5)), size = 1))
    rows <- sample(nrow(dat), size = size)
    s <- sum(dat[[fld]][rows])
    ind <- sumbetween[1] <= s & s <= sumbetween[2]
    # incremental mean and almost-variance of the samples
    # http://datagenetics.com/blog/november22017/index.html
    lastmu <- mu
    mu <- sum(s, (n-1)*mu, na.rm = TRUE)/n
    Sn <- Sn + sum(s, -lastmu, na.rm = TRUE)*sum(s, -mu, na.rm = TRUE)
  }

  out <- if (ind) dat[rows,] else NA
  if (!ind) warning("unable to find a successful sample after ", n, " iterations")
  # actual mean and variance of samples, successful or not
  attr(out, "mu") <- mu
  attr(out, "sigma") <- sqrt(Sn / n)
  return(out)
}

And its use is below. I use str here to demonstrate one feature: the addition of the all tested samples' means and deviations as attributes. If success, the attributes are not shown (print.data.frame by default shows no attributes), but if it fails then a warning will be given, and NA returned with the same attributes.

set.seed(42)
str(random_sample(mtcars, "mpg", c(90,100), iters=20))
# Warning in random_sample(mtcars, "mpg", c(90, 100), iters = 20) :
#   unable to find a successful sample after 20 iterations
#  logi NA
#  - attr(*, "mu")= num 106
#  - attr(*, "sigma")= num 37.9
str(random_sample(mtcars, "mpg", c(90,100), iters=20))
# 'data.frame': 5 obs. of  12 variables:
#  $ mpg : num  33.9 14.3 14.7 18.1 17.3
#  $ cyl : num  4 8 8 6 8
#  $ disp: num  71.1 360 440 225 275.8
#  $ hp  : num  65 245 230 105 180
#  $ drat: num  4.22 3.21 3.23 2.76 3.07
#  $ wt  : num  1.83 3.57 5.34 3.46 3.73
#  $ qsec: num  19.9 15.8 17.4 20.2 17.6
#  $ vs  : num  1 0 0 1 0
#  $ am  : num  1 0 0 0 0
#  $ gear: num  4 3 3 3 3
#  $ carb: num  1 4 4 1 3
#  $ new1: num  75.1 368 448 231 283.8
#  - attr(*, "mu")= num 96.1
#  - attr(*, "sigma")= num 42.1

The intent of the returns mean/deviation is to help the user determine if the suggestn (suggestion for a starting sample size) is mis-placed, or if iters is just too small and we quit too early (such as when the intended range is well within mu +/- sigma).

This uses iters to prevent an infinite loop. You can disable it (off to the races!) at your own peril.

This makes no promises that a feasible solution can be found. Imagine all values are multiples of 20, and the desired range is only 10 wide. There are certainly other conditions that are heuristically difficult to "know" with certainty to know if a solution exists.

回答2:

This is working. because of the values of mpg, it couldn't get more than 90.

ransmpl <- function(df) { 
  s1<- df[sample(rownames(df),1),] 
  s11 <- sum(s1$mpg) 
  while(s11<100){
    rn2<- rownames(df[!(rownames(df) %in% rownames(s1)),]) 
    nr<- df[sample(rn2,1),] 
    s11 <- sum(rbind(s1,nr)$mpg) 
    if(s11>100){ 
      break() 
    } 
    s1<-rbind(s1,nr) 
  } 
  return(s1) 
  }


chosen_sample <- ransmpl(mtcars)
chosen_sample

Output

> chosen_sample
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Merc 280C         17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4

> sum(chosen_sample$mpg)
[1] 95.1

来源：https://stackoverflow.com/questions/61298625/draw-a-random-sample-without-replacement-based-on-a-strict-range-in-r

标签

random

dplyr