I need to draw a stratified sample with n
observation in each stratum, but some strata have fewer observations than n
. If a stratum has too few observations (say, k<n
observations), I want to sample all k
observations from that stratum.
require(sampling)
n <- 10
geo_ID <- c(rep(1, times = 20), rep(2, times = 20), rep(c(1, 2, 3, 4), times = 5))
set.seed(42)
V1 <- rnorm(60, 0, 1)
V2 <- rnorm(60, 2, 1)
DF <- data.frame(geo_ID = geo_ID, V1 = V1, V2 = V2)
#Sort as explained in ?strata help file
DF <- DF[order(DF[, "geo_ID"]), ]
strata(DF, stratanames = "geo_ID", size = c(n, n, n, n), method = "srswor")
If I use sampling without replacement as above, I (understandably) get the error:
Error in strata(DF, stratanames = "geo_ID", size = c(10, 10, 10, 10), :
not enough obervations in the stratum
Sampling with replacement avoids the error, method = "srswr"
, but that's not ideal since it sometimes draws repeats for strata that are sufficiently large to have only unique sample draws.
NOTE: There's a similar question on SO but it wasn't really answered. Also I think this question is more general. (Stratified sampling - not enough observations) The answers to the linked question are not generally useful since they require either (i) sample sizes proportional to the stratum size (whereas, I need a fixed number) or (ii) manually programming stratum-by-stratum, which isn't feasible as the number of strata increases.
This doesn't answer your question about how to do this with the "sampling" package, but I've written a function called stratified
that will do this for you.
If you have "devtools" installed, you can load it like this:
library(devtools)
source_gist(6424112)
Otherwise, just copy the code of the function from the Gist into your session and have fun.
Usage is simple:
set.seed(1) ## So you can reproduce this
stratified(DF, group = "geo_ID", size = 10)
# Some groups
# ---3, 4---
# contain fewer observations than desired number of samples.
# All observations have been returned from those groups.
# geo_ID V1 V2
# 7 1 1.51152200 2.3358481
# 9 1 2.01842371 2.9207286
# 14 1 -0.27878877 1.0464766
# 20 1 1.32011335 0.9002191
# 5 1 0.40426832 1.2727079
# :::SNIP:::
# 43 3 0.75816324 0.9967914
# 47 3 -0.81139318 1.5777441
# 55 3 0.08976065 0.3389009
# 51 3 0.32192527 1.9749074
# 48 4 1.44410126 1.8776498
# 44 4 -0.72670483 3.8484819
# 60 4 0.28488295 2.1372562
# 52 4 -0.78383894 2.1080727
# 56 4 0.27655075 1.6176663
There are some "fun" features, like subsetting your strata in the function itself:
## Selects only "geo_ID" values equal to 1 or 4
stratified(DF, group = "geo_ID", size = 10, select = list(geo_ID = c(1, 4)))
... taking a proportionate sample:
## Just set the size argument to a value less than 1
stratified(DF, group = "geo_ID", size = .1)
... and using multiple columns as your groups. The comments at the Gist include some examples to try out.
来源:https://stackoverflow.com/questions/23270152/stratified-sample-when-some-strata-are-too-small