Algorithm for sampling without replacement?

前端 未结 6 1615
情歌与酒
情歌与酒 2020-12-02 13:56

I am trying to test the likelihood that a particular clustering of data has occurred by chance. A robust way to do this is Monte Carlo simulation, in which the associations

相关标签:
6条回答
  • 2020-12-02 14:14

    Here's some code for sampling without replacement based on Algorithm 3.4.2S of Knuth's book Seminumeric Algorithms.

    void SampleWithoutReplacement
    (
        int populationSize,    // size of set sampling from
        int sampleSize,        // size of each sample
        vector<int> & samples  // output, zero-offset indicies to selected items
    )
    {
        // Use Knuth's variable names
        int& n = sampleSize;
        int& N = populationSize;
    
        int t = 0; // total input records dealt with
        int m = 0; // number of items selected so far
        double u;
    
        while (m < n)
        {
            u = GetUniform(); // call a uniform(0,1) random number generator
    
            if ( (N - t)*u >= n - m )
            {
                t++;
            }
            else
            {
                samples[m] = t;
                t++; m++;
            }
        }
    }
    

    There is a more efficient but more complex method by Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67.

    0 讨论(0)
  • 2020-12-02 14:22

    See my answer to this question Unique (non-repeating) random numbers in O(1)?. The same logic should accomplish what you are looking to do.

    0 讨论(0)
  • 2020-12-02 14:23

    A C++ working code based on the answer by John D. Cook.

    #include <random>
    #include <vector>
    
    double GetUniform()
    {
        static std::default_random_engine re;
        static std::uniform_real_distribution<double> Dist(0,1);
        return Dist(re);
    }
    
    // John D. Cook, https://stackoverflow.com/a/311716/15485
    void SampleWithoutReplacement
    (
        int populationSize,    // size of set sampling from
        int sampleSize,        // size of each sample
        std::vector<int> & samples  // output, zero-offset indicies to selected items
    )
    {
        // Use Knuth's variable names
        int& n = sampleSize;
        int& N = populationSize;
    
        int t = 0; // total input records dealt with
        int m = 0; // number of items selected so far
        double u;
    
        while (m < n)
        {
            u = GetUniform(); // call a uniform(0,1) random number generator
    
            if ( (N - t)*u >= n - m )
            {
                t++;
            }
            else
            {
                samples[m] = t;
                t++; m++;
            }
        }
    }
    
    #include <iostream>
    int main(int,char**)
    {
      const size_t sz = 10;
      std::vector< int > samples(sz);
      SampleWithoutReplacement(10*sz,sz,samples);
      for (size_t i = 0; i < sz; i++ ) {
        std::cout << samples[i] << "\t";
      }
    
      return 0;
    }
    
    0 讨论(0)
  • 2020-12-02 14:31

    Another algorithm for sampling without replacement is described here.

    It is similar to the one described by John D. Cook in his answer and also from Knuth, but it has different hypothesis: The population size is unknown, but the sample can fit in memory. This one is called "Knuth's algorithm S".

    Quoting the rosettacode article:

    1. Select the first n items as the sample as they become available;
    2. For the i-th item where i > n, have a random chance of n/i of keeping it. If failing this chance, the sample remains the same. If not, have it randomly (1/n) replace one of the previously selected n items of the sample.
    3. Repeat #2 for any subsequent items.
    0 讨论(0)
  • 2020-12-02 14:38

    Inspired by @John D. Cook's answer, I wrote an implementation in Nim. At first I had difficulties understanding how it works, so I commented extensively also including an example. Maybe it helps to understand the idea. Also, I have changed the variable names slightly.

    iterator uniqueRandomValuesBelow*(N, M: int) =
      ## Returns a total of M unique random values i with 0 <= i < N
      ## These indices can be used to construct e.g. a random sample without replacement
      assert(M <= N)
    
      var t = 0 # total input records dealt with
      var m = 0 # number of items selected so far
    
      while (m < M):
        let u = random(1.0) # call a uniform(0,1) random number generator
    
        # meaning of the following terms:
        # (N - t) is the total number of remaining draws left (initially just N)
        # (M - m) is the number how many of these remaining draw must be positive (initially just M)
        # => Probability for next draw = (M-m) / (N-t)
        #    i.e.: (required positive draws left) / (total draw left)
        #
        # This is implemented by the inequality expression below:
        # - the larger (M-m), the larger the probability of a positive draw
        # - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100%
        # - for (N-t) >> (M-m), we must get a very small u
        #
        # example: (N-t) = 7, (M-m) = 5
        # => we draw the next with prob 5/7
        #    lets assume the draw fails
        # => t += 1 => (N-t) = 6
        # => we draw the next with prob 5/6
        #    lets assume the draw succeeds
        # => t += 1, m += 1 => (N-t) = 5, (M-m) = 4
        # => we draw the next with prob 4/5
        #    lets assume the draw fails
        # => t += 1 => (N-t) = 4
        # => we draw the next with prob 4/4, i.e.,
        #    we will draw with certainty from now on
        #    (in the next steps we get prob 3/3, 2/2, ...)
        if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m) / (N-t)
          # no draw -- happens mainly for (N-t) >> (M-m) and/or high u
          t += 1
        else:
          # draw t -- happens when (M-m) gets large and/or low u
          yield t # this is where we output an index, can be used to sample
          t += 1
          m += 1
    
    # example use
    for i in uniqueRandomValuesBelow(100, 5):
      echo i
    
    0 讨论(0)
  • When the population size is much greater than the sample size, the above algorithms become inefficient, since they have complexity O(n), n being the population size.

    When I was a student I wrote some algorithms for uniform sampling without replacement, which have average complexity O(s log s), where s is the sample size. Here is the code for the binary tree algorithm, with average complexity O(s log s), in R:

    # The Tree growing algorithm for uniform sampling without replacement
    # by Pavel Ruzankin 
    quicksample = function (n,size)
    # n - the number of items to choose from
    # size - the sample size
    {
      s=as.integer(size)
      if (s>n) {
        stop("Sample size is greater than the number of items to choose from")
      }
      # upv=integer(s) #level up edge is pointing to
      leftv=integer(s) #left edge is poiting to; must be filled with zeros
      rightv=integer(s) #right edge is pointig to; must be filled with zeros
      samp=integer(s) #the sample
      ordn=integer(s) #relative ordinal number
    
      ordn[1L]=1L #initial value for the root vertex
      samp[1L]=sample(n,1L) 
      if (s > 1L) for (j in 2L:s) {
        curn=sample(n-j+1L,1L) #current number sampled
        curordn=0L #currend ordinal number
        v=1L #current vertice
        from=1L #how have come here: 0 - by left edge, 1 - by right edge
        repeat {
          curordn=curordn+ordn[v]
          if (curn+curordn>samp[v]) { #going down by the right edge
            if (from == 0L) {
              ordn[v]=ordn[v]-1L
            }
            if (rightv[v]!=0L) {
              v=rightv[v]
              from=1L
            } else { #creating a new vertex
              samp[j]=curn+curordn
              ordn[j]=1L
              # upv[j]=v
              rightv[v]=j
              break
            }
          } else { #going down by the left edge
            if (from==1L) {
              ordn[v]=ordn[v]+1L
            }
            if (leftv[v]!=0L) {
              v=leftv[v]
              from=0L
            } else { #creating a new vertex
              samp[j]=curn+curordn-1L
              ordn[j]=-1L
              # upv[j]=v
              leftv[v]=j
              break
            }
          }
        }
      }
      return(samp)  
    }
    

    The complexity of this algorithm is discussed in: Rouzankin, P. S.; Voytishek, A. V. On the cost of algorithms for random selection. Monte Carlo Methods Appl. 5 (1999), no. 1, 39-54. http://dx.doi.org/10.1515/mcma.1999.5.1.39

    If you find the algorithm useful, please make a reference.

    See also: P. Gupta, G. P. Bhattacharjee. (1984) An efficient algorithm for random sampling without replacement. International Journal of Computer Mathematics 16:4, pages 201-209. DOI: 10.1080/00207168408803438

    Teuhola, J. and Nevalainen, O. 1982. Two efficient algorithms for random sampling without replacement. /IJCM/, 11(2): 127–140. DOI: 10.1080/00207168208803304

    In the last paper the authors use hash tables and claim that their algorithms have O(s) complexity. There is one more fast hash table algorithm, which will soon be implemented in pqR (pretty quick R): https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html

    0 讨论(0)
提交回复
热议问题