Suppose I have a list, called elements
, each of which does or does not satisfy some boolean property p
. I want to choose one of the elements that
It works mathematically. Can be proven by induction.
Clearly works for n = 1 element satisfying p.
For n+1 elements, we will choose the element n+1 with probability 1/(n+1), so its probability is OK. But how does that effect the end probability of choosing one of the prior n elements?
Each of the prior n had a chance of being selected with probability 1/n, until we found element n+1. Now, after finding n+1, there is a 1/(n+1) chance that element n+1 is chosen, so there is a n/(n+1) chance that the previously chosen element remains chosen. Which means its final probability of being the chosen after n+1 finds is 1/n * (n/n+1) = 1/n+1 -- which is the probability we want for all n+1 elements for uniform distribution.
If it works for n = 1, and it works for n+1 given n, then it works for all n.
For clarity's sake, I would try:
pickRandElement(elements, p)
OrderedCollection coll = new OrderedCollection
foreach element in elements
if (p(element))
coll.add(element)
if (coll.size == 0) return null
else return coll.get(randInt(coll.size))
To me, that makes it MUCH clearer what you're trying to do and is self-documenting. On top of that, it's simpler and more elegant, and it's now obvious that each will be picked with an even distribution.
In The Practice of Programming, pg. 70, (The Markov Chain Algorithm) there is a similar algorithm for that:
[...]
nmatch = 0;
for ( /* iterate list */ )
if (rand() % ++nmatch == 0) /* prob = 1/nmatch */
w = suf->word;
[...]
"Notice the algorithm for selecting one item at random when we don't know how many items there are. The variable nmatch counts the number of matches as the list is scanned. The expression
rand() % ++nmatch == 0
increments nmatch and is then true with probability 1/nmatch."
decowboy has a nice proof that this works on TopCoder
Yes, I believe so.
The first time you encounter a matching element, you definitely pick it. The next time, you pick the new value with a probability of 1/2, so each of the two elements have an equal chance. The following time, you pick the new value with a probability of 1/3, leaving each of the other elements with a probability of 1/2 * 2/3 = 1/3 as well.
I'm trying to find a Wikipedia article about this strategy, but failing so far...
Note that more generally, you're just picking a random sample out of a sequence of unknown length. Your sequence happens to be generated by taking an initial sequence and filtering it, but the algorithm doesn't require that part at all.
I thought I'd got a LINQ operator in MoreLINQ to do this, but I can't find it in the repository... EDIT: Fortunately, it still exists from this answer:
public static T RandomElement<T>(this IEnumerable<T> source,
Random rng)
{
T current = default(T);
int count = 0;
foreach (T element in source)
{
count++;
if (rng.Next(count) == 0)
{
current = element;
}
}
if (count == 0)
{
throw new InvalidOperationException("Sequence was empty");
}
return current;
}