问题
I'm trying to draw a random sample of rows without replacement from a dataset such that the sum of a column in the sample should be strictly within a range. For the example dataset mtcars
, the random sample should be such that the sum of mpg
is strictly within 90-100.
A reproducible example:
data("mtcars")
random_sample <- function(dataset){
final_mpg = 0
while (final_mpg < 100) {
basic_dat <- dataset %>%
sample_n(1) %>%
ungroup()
total_mpg <- basic_dat %>%
summarise(mpg = sum(mpg)) %>%
pull(mpg)
final_mpg <- final_mpg + total_mpg
if (final_mpg > 90 & final_mpg < 100){
break()
}
final_dat <- rbind(get0("final_dat"), get0("basic_dat"))
}
return(final_dat)
}
chosen_sample <- random_sample(mtcars)
But this function output samples with sum(mpg) > 100
. How do I ensure that every sample it generates is strictly within that range? Any help is much appreciated.
回答1:
Here's a hack, though realize that there's never a guarantee that it'll work.
#' Random sampling of data
#'
#' Return a sample of the dataset's rows where the sum of 'fld' values
#' is between the two numbers of 'sumbetween'.
#'
#' @param dat data.frame
#' @param fld character, the name of one of the fields in 'dat'
#' @param sumbetween numeric, length 2, the two ends of the range of
#' desired sum
#' @param suggestn integer, a suggestion for 'n' around which sample
#' sizes are based; the actual samples attempted will vary between
#' 0.5 and 1.5 times this value; if 'NA' (the default), then it
#' defaults naively to 'mean(sumbetween) / median(dat[[fld]])'
#' @param iters integer, number of samples to attempt before
#' "giving up" (otherwise this might run forever)
#' @return data.frame, a sample of the original dataset; regardless of
#' success, two attributes are included, 'mu' and 'sigma',
#' indicating the mean and standard deviation of the samples tested
random_sample <- function(dat, fld, sumbetween, suggestn = NA, iters = 100) {
stopifnot(fld %in% names(dat), is.numeric(dat[[fld]]), is.numeric(sumbetween))
if (is.na(suggestn)) {
suggestn <- mean(sumbetween) / median(dat[[fld]])
}
suggestn <- min(suggestn, nrow(dat))
mu <- NA
Sn <- 0
ind <- FALSE
n <- 0L
while ((is.na(iters) || n < iters) && !ind) {
n <- n + 1L
size <- min(nrow(dat), sample(seq(max(1, floor(suggestn/2)), ceiling(suggestn*1.5)), size = 1))
rows <- sample(nrow(dat), size = size)
s <- sum(dat[[fld]][rows])
ind <- sumbetween[1] <= s & s <= sumbetween[2]
# incremental mean and almost-variance of the samples
# http://datagenetics.com/blog/november22017/index.html
lastmu <- mu
mu <- sum(s, (n-1)*mu, na.rm = TRUE)/n
Sn <- Sn + sum(s, -lastmu, na.rm = TRUE)*sum(s, -mu, na.rm = TRUE)
}
out <- if (ind) dat[rows,] else NA
if (!ind) warning("unable to find a successful sample after ", n, " iterations")
# actual mean and variance of samples, successful or not
attr(out, "mu") <- mu
attr(out, "sigma") <- sqrt(Sn / n)
return(out)
}
And its use is below. I use str
here to demonstrate one feature: the addition of the all tested samples' means and deviations as attributes. If success, the attributes are not shown (print.data.frame
by default shows no attributes), but if it fails then a warning will be given, and NA
returned with the same attributes.
set.seed(42)
str(random_sample(mtcars, "mpg", c(90,100), iters=20))
# Warning in random_sample(mtcars, "mpg", c(90, 100), iters = 20) :
# unable to find a successful sample after 20 iterations
# logi NA
# - attr(*, "mu")= num 106
# - attr(*, "sigma")= num 37.9
str(random_sample(mtcars, "mpg", c(90,100), iters=20))
# 'data.frame': 5 obs. of 12 variables:
# $ mpg : num 33.9 14.3 14.7 18.1 17.3
# $ cyl : num 4 8 8 6 8
# $ disp: num 71.1 360 440 225 275.8
# $ hp : num 65 245 230 105 180
# $ drat: num 4.22 3.21 3.23 2.76 3.07
# $ wt : num 1.83 3.57 5.34 3.46 3.73
# $ qsec: num 19.9 15.8 17.4 20.2 17.6
# $ vs : num 1 0 0 1 0
# $ am : num 1 0 0 0 0
# $ gear: num 4 3 3 3 3
# $ carb: num 1 4 4 1 3
# $ new1: num 75.1 368 448 231 283.8
# - attr(*, "mu")= num 96.1
# - attr(*, "sigma")= num 42.1
The intent of the returns mean/deviation is to help the user determine if the suggestn
(suggestion for a starting sample size) is mis-placed, or if iters
is just too small and we quit too early (such as when the intended range is well within mu +/- sigma
).
This uses iters
to prevent an infinite loop. You can disable it (off to the races!) at your own peril.
This makes no promises that a feasible solution can be found. Imagine all values are multiples of 20, and the desired range is only 10 wide. There are certainly other conditions that are heuristically difficult to "know" with certainty to know if a solution exists.
回答2:
This is working. because of the values of mpg, it couldn't get more than 90.
ransmpl <- function(df) {
s1<- df[sample(rownames(df),1),]
s11 <- sum(s1$mpg)
while(s11<100){
rn2<- rownames(df[!(rownames(df) %in% rownames(s1)),])
nr<- df[sample(rn2,1),]
s11 <- sum(rbind(s1,nr)$mpg)
if(s11>100){
break()
}
s1<-rbind(s1,nr)
}
return(s1)
}
chosen_sample <- ransmpl(mtcars)
chosen_sample
Output
> chosen_sample
mpg cyl disp hp drat wt qsec vs am gear carb
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
> sum(chosen_sample$mpg)
[1] 95.1
来源:https://stackoverflow.com/questions/61298625/draw-a-random-sample-without-replacement-based-on-a-strict-range-in-r