What is O value for naive random selection from finite set?

前端 未结 8 1846
终归单人心
终归单人心 2021-02-05 17:25

This question on getting random values from a finite set got me thinking...

It\'s fairly common for people to want to retrieve X unique values from a set of Y values.

8条回答
  •  闹比i
    闹比i (楼主)
    2021-02-05 18:12

    Before being able to answer this question in details, lets define the framework. Suppose you have a collection {a1, a2, ..., an} of n distinct objects, and want to pick m distinct objects from this set, such that the probability of a given object aj appearing in the result is equal for all objects.

    If you have already picked k items, and radomly pick an item from the full set {a1, a2, ..., an}, the probability that the item has not been picked before is (n-k)/n. This means that the number of samples you have to take before you get a new object is (assuming independence of random sampling) geometric with parameter (n-k)/n. Thus the expected number of samples to obtain one extra item is n/(n-k), which is close to 1 if k is small compared to n.

    Concluding, if you need m unique objects, randomly selected, this algorithm gives you

    n/n + n/(n-1) + n/(n-2) + n/(n-3) + .... + n/(n-(m-1))

    which, as Alderath showed, can be estimated by

    m*n / (n-m+1).

    You can see a little bit more from this formula: * The expected number of samples to obtain a new unique element increases as the number of already chosen objects increases (which sounds logical). * You can expect really long computation times when m is close to n, especially if n is large.

    In order to obtain m unique members from the set, use a variant of David Knuth's algorithm for obtaining a random permutation. Here, I'll assume that the n objects are stored in an array.

    for i = 1..m
      k = randInt(i, n)
      exchange(i, k)
    end
    

    here, randInt samples an integer from {i, i+1, ... n}, and exchange flips two members of the array. You only need to shuffle m times, so the computation time is O(m), whereas the memory is O(n) (although you can adapt it to only save the entries such that a[i] <> i, which would give you O(m) on both time and memory, but with higher constants).

提交回复
热议问题