Given an array of n
word-frequency pairs:
[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]<
You could create the target array, then loop through the words determining the probability that it should be picked, and replace the words in the array according to a random number.
For the first word the probability would be f0/m0 (where mn=f0+..+fn), i.e. 100%, so all positions in the target array would be filled with w0.
For the following words the probability falls, and when you reach the last word the target array is filled with randomly picked words accoding to the frequency.
Example code in C#:
public class WordFrequency {
public string Word { get; private set; }
public int Frequency { get; private set; }
public WordFrequency(string word, int frequency) {
Word = word;
Frequency = frequency;
}
}
WordFrequency[] words = new WordFrequency[] {
new WordFrequency("Hero", 80),
new WordFrequency("Monkey", 4),
new WordFrequency("Shoe", 13),
new WordFrequency("Highway", 3),
};
int p = 7;
string[] result = new string[p];
int sum = 0;
Random rnd = new Random();
foreach (WordFrequency wf in words) {
sum += wf.Frequency;
for (int i = 0; i < p; i++) {
if (rnd.Next(sum) < wf.Frequency) {
result[i] = wf.Word;
}
}
}
Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:
n
partitions, all of the same width r
s.t. nr = m
.wi
, fi = ∑partitions t s.t wi ∈ t r × ratio(t,wi)
Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1
at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p
selections can be done in O(p)
work, given such a partition.
The reason that such a partitioning exists is that there exists a word wi
s.t. fi < r
, if and only if there exists a word wi'
s.t. fi' > r
, since r is the average of the frequencies.
Given such a pair wi
and wi'
we can replace them with a pseudo-word w'i
of frequency f'i = r
(that represents wi
with probability fi/r
and wi'
with probability 1 - fi/r
) and a new word w'i'
of adjusted frequency f'i' = fi' - (r - fi)
respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.
To construct this partition in O(n)
time,
This actually still works if the number of partitions q > n
(you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q
of m
s.t. q > n
, you can pad all the frequencies by a factor of n
, so f'i = nfi
, which updates m' = mn
and sets r' = m
when q = n
.
In any case, this algorithm only takes O(n + p)
work, which I have to think is optimal.
In ruby:
def weighted_sample_with_replacement(input, p)
n = input.size
m = input.inject(0) { |sum,(word,freq)| sum + freq }
# find the words with frequency lesser and greater than average
lessers, greaters = input.map do |word,freq|
# pad the frequency so we can keep it integral
# when subdivided
[ word, freq*n ]
end.partition do |word,adj_freq|
adj_freq <= m
end
partitions = Array.new(n) do
word, adj_freq = lessers.shift
other_word = if adj_freq < m
# use part of another word's frequency to pad
# out the partition
other_word, other_adj_freq = greaters.shift
other_adj_freq -= (m - adj_freq)
(other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
other_word
end
[ word, other_word , adj_freq ]
end
(0...p).map do
# pick a partition at random
word, other_word, adj_freq = partitions[ rand(n) ]
# select the first word in the partition with appropriate
# probability
if rand(m) < adj_freq
word
else
other_word
end
end
end
This sounds like roulette wheel selection, mainly used for the selection process in genetic/evolutionary algorithms.
Look at Roulette Selection in Genetic Algorithms