Efficient algorithm to randomly select items with frequency

前端 未结 3 1517
孤独总比滥情好
孤独总比滥情好 2021-01-02 06:52

Given an array of n word-frequency pairs:

[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]<         


        
相关标签:
3条回答
  • 2021-01-02 07:19

    You could create the target array, then loop through the words determining the probability that it should be picked, and replace the words in the array according to a random number.

    For the first word the probability would be f0/m0 (where mn=f0+..+fn), i.e. 100%, so all positions in the target array would be filled with w0.

    For the following words the probability falls, and when you reach the last word the target array is filled with randomly picked words accoding to the frequency.

    Example code in C#:

    public class WordFrequency {
    
        public string Word { get; private set; }
        public int Frequency { get; private set; }
    
        public WordFrequency(string word, int frequency) {
            Word = word;
            Frequency = frequency;
        }
    
    }
    
    WordFrequency[] words = new WordFrequency[] {
        new WordFrequency("Hero", 80),
        new WordFrequency("Monkey", 4),
        new WordFrequency("Shoe", 13),
        new WordFrequency("Highway", 3),
    };
    
    int p = 7;
    string[] result = new string[p];
    int sum = 0;
    Random rnd = new Random();
    foreach (WordFrequency wf in words) {
        sum += wf.Frequency;
        for (int i = 0; i < p; i++) {
            if (rnd.Next(sum) < wf.Frequency) {
                result[i] = wf.Word;
            }
        }
    }
    
    0 讨论(0)
  • 2021-01-02 07:29

    Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:

    • There are n partitions, all of the same width r s.t. nr = m.
    • each partition contains two words in some ratio (which is stored with the partition).
    • for each word wi, fi = ∑partitions t s.t wi ∈ t r × ratio(t,wi)

    Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1 at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p selections can be done in O(p) work, given such a partition.

    The reason that such a partitioning exists is that there exists a word wi s.t. fi < r, if and only if there exists a word wi' s.t. fi' > r, since r is the average of the frequencies.

    Given such a pair wi and wi' we can replace them with a pseudo-word w'i of frequency f'i = r (that represents wi with probability fi/r and wi' with probability 1 - fi/r) and a new word w'i' of adjusted frequency f'i' = fi' - (r - fi) respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.

    To construct this partition in O(n) time,

    • go through the list of the words once, constructing two lists:
      • one of words with frequency ≤ r
      • one of words with frequency > r
    • then pull a word from the first list
      • if its frequency = r, then make it into a one element partition
      • otherwise, pull a word from the other list, and use it to fill out a two-word partition. Then put the second word back into either the first or second list according to its adjusted frequency.

    This actually still works if the number of partitions q > n (you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q of m s.t. q > n, you can pad all the frequencies by a factor of n, so f'i = nfi, which updates m' = mn and sets r' = m when q = n.

    In any case, this algorithm only takes O(n + p) work, which I have to think is optimal.

    In ruby:

    def weighted_sample_with_replacement(input, p)
      n = input.size
      m = input.inject(0) { |sum,(word,freq)| sum + freq }
    
      # find the words with frequency lesser and greater than average
      lessers, greaters = input.map do |word,freq| 
                            # pad the frequency so we can keep it integral
                            # when subdivided
                            [ word, freq*n ] 
                          end.partition do |word,adj_freq| 
                            adj_freq <= m 
                          end
    
      partitions = Array.new(n) do
        word, adj_freq = lessers.shift
    
        other_word = if adj_freq < m
                       # use part of another word's frequency to pad
                       # out the partition
                       other_word, other_adj_freq = greaters.shift
                       other_adj_freq -= (m - adj_freq)
                       (other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
                       other_word
                     end
    
        [ word, other_word , adj_freq ]
      end
    
      (0...p).map do 
        # pick a partition at random
        word, other_word, adj_freq = partitions[ rand(n) ]
        # select the first word in the partition with appropriate
        # probability
        if rand(m) < adj_freq
          word
        else
          other_word
        end
      end
    end
    
    0 讨论(0)
  • 2021-01-02 07:36

    This sounds like roulette wheel selection, mainly used for the selection process in genetic/evolutionary algorithms.

    Look at Roulette Selection in Genetic Algorithms

    0 讨论(0)
提交回复
热议问题