What is O value for naive random selection from finite set?

前端未结

关注

 8  1846

终归单人心 2021-02-05 17:25

This question on getting random values from a finite set got me thinking...

It\'s fairly common for people to want to retrieve X unique values from a set of Y values.

8条回答

闹比i (楼主)

2021-02-05 18:12
Before being able to answer this question in details, lets define the framework. Suppose you have a collection {a1, a2, ..., an} of n distinct objects, and want to pick m distinct objects from this set, such that the probability of a given object aj appearing in the result is equal for all objects.

If you have already picked k items, and radomly pick an item from the full set {a1, a2, ..., an}, the probability that the item has not been picked before is (n-k)/n. This means that the number of samples you have to take before you get a new object is (assuming independence of random sampling) geometric with parameter (n-k)/n. Thus the expected number of samples to obtain one extra item is n/(n-k), which is close to 1 if k is small compared to n.

Concluding, if you need m unique objects, randomly selected, this algorithm gives you

n/n + n/(n-1) + n/(n-2) + n/(n-3) + .... + n/(n-(m-1))

which, as Alderath showed, can be estimated by

m*n / (n-m+1).

You can see a little bit more from this formula: * The expected number of samples to obtain a new unique element increases as the number of already chosen objects increases (which sounds logical). * You can expect really long computation times when m is close to n, especially if n is large.

In order to obtain m unique members from the set, use a variant of David Knuth's algorithm for obtaining a random permutation. Here, I'll assume that the n objects are stored in an array.
```
for i = 1..m
  k = randInt(i, n)
  exchange(i, k)
end
```
here, randInt samples an integer from {i, i+1, ... n}, and exchange flips two members of the array. You only need to shuffle m times, so the computation time is O(m), whereas the memory is O(n) (although you can adapt it to only save the entries such that a[i] <> i, which would give you O(m) on both time and memory, but with higher constants).
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...