What would be the fastest algorithm to randomly select N items from a list based on weights distribution?

巧了我就是萌 提交于 2020-12-25 04:22:46

问题


I have a large list of items, each item has a weight.

I'd like to select N items randomly without replacement, while the items with more weight are more probable to be selected.

I'm looking for the most performing idea. Performance is paramount. Any ideas?


回答1:


If you want to sample items without replacement, you have lots of options.

  • Use a weighted-choice-with-replacement algorithm to choose random indices. There are many algorithms like this. One of them is WeightedChoice, described later in this answer, and another is rejection sampling, described as follows. Assume that the highest weight is max and there are n weights. To choose an index in [0, n) using rejection sampling:

    1. Choose a uniform random integer i in [0, n).
    2. With probability weights[i]/max, return i. Otherwise, go to step 1.

    Each time the weighted choice algorithm chooses an index, set the weight for the chosen index to 0 to keep it from being chosen again. Or...

  • Assign each index an exponentially distributed random number (with a rate equal to that index's weight), make a list of pairs assigning each number to an index, then sort that list by those numbers. Then take each item from first to last. This sorting can be done on-line using a priority queue data structure (a technique that leads to weighted reservoir sampling). Notice that the naïve way to generate the random number, -ln(1-RNDU01())/weight, is not robust, however ("Index of Non-Uniform Distributions", under "Exponential distribution").

  • Tim Vieira gives additional options in his blog.

  • A paper by Bram van de Klundert compares various algorithms.

EDIT (Aug. 19): Note that for these solutions, the weight expresses how likely a given item will appear first in the sample. This weight is not necessarily the chance that a given sample of n items will include that item (that is, an inclusion probability). The methods given above will not necessarily ensure that a given item will appear in a random sample with probability proportional to its weight; for that, see "Algorithms of sampling with equal or unequal probabilities".


Previous post:

Assuming you want to choose items at random with replacement, here is pseudocode implementing this kind of choice. Given a list of weights, it returns a random index (starting at 0), chosen with a probability proportional to its weight. See also "Weighted Choice".

METHOD WChoose(weights, value)
    // Choose the index according to the given value
    lastItem = size(weights) - 1
    runningValue = 0
    for i in 0...size(weights) - 1
       if weights[i] > 0
          newValue = runningValue + weights[i]
          lastItem = i
          // NOTE: Includes start, excludes end
          if value < newValue: break
          runningValue = newValue
       end
    end
    // If we didn't break above, this is a last
    // resort (might happen because rounding
    // error happened somehow)
    return lastItem
END METHOD

METHOD WeightedChoice(weights)
    return WChoose(weights, RNDINTEXC(Sum(weights)))
END METHOD

This algorithm is a straightforward way to implement weighted choice, but if it's too slow for you, the following alternatives may be faster:

  • Vose's alias method, a variant of the original Walker's alias method. See "Darts, Dice, and Coins: Sampling from a Discrete Distribution" by Keith Schwarz for more information.
  • The Fast Loaded Dice Roller.



回答2:


Let A be the item array with x itens. The complexity of each method is defined as

< preprocessing_time, querying_time >


If sorting is possible: < O(x lg x), O(n) >

  1. sort A by the weight of the itens.
  2. create an array B, for example:

    • B = [ 0, 0, 0, x/2, x/2, x/2, x/2, x/2 ].
    • it's clear to see that B has a bigger probability from choosing x/2.
  3. if you haven't picked n elements yet, choose a random element e from B.

  4. pick a random element from A within the interval e : x-1.

If iterating through the itens is possible: < O(x), O(tn) >

  1. iterate through A and find the average weight w of the elements.
  2. define the maximum number of tries t.
  3. try (at most t times) to pick a random number in A whose weight is bigger than w.
    • test for some t that gives you good/satisfactory results.

If nothing above is possible: < O(1), O(tn) >

  1. define the maximum number of tries t.
  2. if you haven't picked n elements yet, take t random elements in A.
  3. pick the element with biggest value.
    • test for some t that gives you good/satisfactory results.


来源:https://stackoverflow.com/questions/62455064/what-would-be-the-fastest-algorithm-to-randomly-select-n-items-from-a-list-based

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!