How can I create a stratified sample in R using the \"sampling\" package? My dataset has 355,000 observations. The code works fine up to the last line. Below is the code I w
Without knowing of the strata function - a bit of coding might do what want:
d <- expand.grid(id = 1:35000, stratum = letters[1:10])
p = 0.1
dsample <- data.frame()
system.time(
for(i in levels(d$stratum)) {
dsub <- subset(d, d$stratum == i)
B = ceiling(nrow(dsub) * p)
dsub <- dsub[sample(1:nrow(dsub), B), ]
dsample <- rbind(dsample, dsub)
}
)
# size per stratum in resulting df is 10 % of original size:
table(dsample$stratum)
HTH, Kay
ps: CPU time on my relict laptop is 0.09!
I had to do something similar last year. If this is something you do a lot, you might want to use a function like the one below. This function lets you specify the name of the data frame you're sampling from, which variable is the ID variable, which is the strata, and if you want to use "set.seed". You can save the function as something like "stratified.R" and load it when you need to. See http://news.mrdwab.com/2011/05/20/stratified-random-sampling-in-r-from-a-data-frame/
stratified = function(df, group, size) {
# USE: * Specify your data frame and grouping variable (as column
# number) as the first two arguments.
# * Decide on your sample size. For a sample proportional to the
# population, enter "size" as a decimal. For an equal number
# of samples from each group, enter "size" as a whole number.
#
# Example 1: Sample 10% of each group from a data frame named "z",
# where the grouping variable is the fourth variable, use:
#
# > stratified(z, 4, .1)
#
# Example 2: Sample 5 observations from each group from a data frame
# named "z"; grouping variable is the third variable:
#
# > stratified(z, 3, 5)
#
require(sampling)
temp = df[order(df[group]),]
if (size < 1) {
size = ceiling(table(temp[group]) * size)
} else if (size >= 1) {
size = rep(size, times=length(table(temp[group])))
}
strat = strata(temp, stratanames = names(temp[group]),
size = size, method = "srswor")
(dsample = getdata(temp, strat))
}