How to create a stratified sample by state in R

前端 未结 2 1833
长情又很酷
长情又很酷 2021-01-01 03:27

How can I create a stratified sample in R using the \"sampling\" package? My dataset has 355,000 observations. The code works fine up to the last line. Below is the code I w

相关标签:
2条回答
  • 2021-01-01 03:49

    Without knowing of the strata function - a bit of coding might do what want:

    d <- expand.grid(id = 1:35000, stratum = letters[1:10])
    
    p = 0.1
    
    dsample <- data.frame()
    
    system.time(
    for(i in levels(d$stratum)) {
      dsub <- subset(d, d$stratum == i)
      B = ceiling(nrow(dsub) * p)
      dsub <- dsub[sample(1:nrow(dsub), B), ]
      dsample <- rbind(dsample, dsub) 
      }
    )
    
    # size per stratum in resulting df is 10 % of original size:
    table(dsample$stratum)
    

    HTH, Kay

    ps: CPU time on my relict laptop is 0.09!

    0 讨论(0)
  • 2021-01-01 04:02

    I had to do something similar last year. If this is something you do a lot, you might want to use a function like the one below. This function lets you specify the name of the data frame you're sampling from, which variable is the ID variable, which is the strata, and if you want to use "set.seed". You can save the function as something like "stratified.R" and load it when you need to. See http://news.mrdwab.com/2011/05/20/stratified-random-sampling-in-r-from-a-data-frame/

    stratified = function(df, group, size) {
      #  USE: * Specify your data frame and grouping variable (as column 
      #         number) as the first two arguments.
      #       * Decide on your sample size. For a sample proportional to the
      #         population, enter "size" as a decimal. For an equal number 
      #         of samples from each group, enter "size" as a whole number.
      #
      #  Example 1: Sample 10% of each group from a data frame named "z",
      #             where the grouping variable is the fourth variable, use:
      # 
      #                 > stratified(z, 4, .1)
      #
      #  Example 2: Sample 5 observations from each group from a data frame
      #             named "z"; grouping variable is the third variable:
      #
      #                 > stratified(z, 3, 5)
      #
      require(sampling)
      temp = df[order(df[group]),]
      if (size < 1) {
        size = ceiling(table(temp[group]) * size)
      } else if (size >= 1) {
        size = rep(size, times=length(table(temp[group])))
      }  
      strat = strata(temp, stratanames = names(temp[group]), 
                     size = size, method = "srswor")
      (dsample = getdata(temp, strat))
    }
    
    0 讨论(0)
提交回复
热议问题