问题
I have a large list of items, each item has a weight.
I'd like to select N items randomly without replacement, while the items with more weight are more probable to be selected.
I'm looking for the most performing idea. Performance is paramount. Any ideas?
回答1:
If you want to sample items without replacement, you have lots of options.
Use a weighted-choice-with-replacement algorithm to choose random indices. There are many algorithms like this. One of them is
WeightedChoice
, described later in this answer, and another is rejection sampling, described as follows. Assume that the highest weight ismax
and there aren
weights. To choose an index in [0,n
) using rejection sampling:- Choose a uniform random integer
i
in [0,n
). - With probability
weights[i]/max
, returni
. Otherwise, go to step 1.
Each time the weighted choice algorithm chooses an index, set the weight for the chosen index to 0 to keep it from being chosen again. Or...
- Choose a uniform random integer
Assign each index an exponentially distributed random number (with a rate equal to that index's weight), make a list of pairs assigning each number to an index, then sort that list by those numbers. Then take each item from first to last. This sorting can be done on-line using a priority queue data structure (a technique that leads to weighted reservoir sampling). Notice that the naïve way to generate the random number,
-ln(1-RNDU01())/weight
, is not robust, however ("Index of Non-Uniform Distributions", under "Exponential distribution").Tim Vieira gives additional options in his blog.
A paper by Bram van de Klundert compares various algorithms.
EDIT (Aug. 19): Note that for these solutions, the weight expresses how likely a given item will appear first in the sample. This weight is not necessarily the chance that a given sample of n items will include that item (that is, an inclusion probability). The methods given above will not necessarily ensure that a given item will appear in a random sample with probability proportional to its weight; for that, see "Algorithms of sampling with equal or unequal probabilities".
Previous post:
Assuming you want to choose items at random with replacement, here is pseudocode implementing this kind of choice. Given a list of weights, it returns a random index (starting at 0), chosen with a probability proportional to its weight. See also "Weighted Choice".
METHOD WChoose(weights, value)
// Choose the index according to the given value
lastItem = size(weights) - 1
runningValue = 0
for i in 0...size(weights) - 1
if weights[i] > 0
newValue = runningValue + weights[i]
lastItem = i
// NOTE: Includes start, excludes end
if value < newValue: break
runningValue = newValue
end
end
// If we didn't break above, this is a last
// resort (might happen because rounding
// error happened somehow)
return lastItem
END METHOD
METHOD WeightedChoice(weights)
return WChoose(weights, RNDINTEXC(Sum(weights)))
END METHOD
This algorithm is a straightforward way to implement weighted choice, but if it's too slow for you, the following alternatives may be faster:
- Vose's alias method, a variant of the original Walker's alias method. See "Darts, Dice, and Coins: Sampling from a Discrete Distribution" by Keith Schwarz for more information.
- The Fast Loaded Dice Roller.
回答2:
Let A
be the item array with x
itens. The complexity of each method is defined as
< preprocessing_time, querying_time >
If sorting is possible: < O(x lg x), O(n) >
- sort
A
by the weight of the itens. create an array
B
, for example:B = [ 0, 0, 0, x/2, x/2, x/2, x/2, x/2 ]
.- it's clear to see that
B
has a bigger probability from choosingx/2
.
if you haven't picked
n
elements yet, choose a random elemente
fromB
.- pick a random element from
A
within the intervale : x-1
.
If iterating through the itens is possible: < O(x), O(tn) >
- iterate through
A
and find the average weightw
of the elements. - define the maximum number of tries
t
. - try (at most
t
times) to pick a random number inA
whose weight is bigger thanw
.- test for some
t
that gives you good/satisfactory results.
- test for some
If nothing above is possible: < O(1), O(tn) >
- define the maximum number of tries
t
. - if you haven't picked
n
elements yet, taket
random elements inA
. - pick the element with biggest value.
- test for some
t
that gives you good/satisfactory results.
- test for some
来源:https://stackoverflow.com/questions/62455064/what-would-be-the-fastest-algorithm-to-randomly-select-n-items-from-a-list-based