I have a collection of elements that I want to shuffle randomly, but every element has a different priority or weight. So the element with bigger weight has to have more pro
Based on the @amit suggestion:
def self.random_suffle_with_weight(elements, &proc)
consecutive_chain = []
elements.each do |element|
proc.call(element).times { consecutive_chain << element }
end
consecutive_chain.shuffle.uniq
end
I can think of two approaches to solve it, though my gut tells me there should be modification for Fisher-Yates to achieve it even better:
O(n*W) solution: (simple to program)
First approach, create duplicates according to the weight (same as your approach), and populate a new list. Now run a standard shuffle (fisher-yates) on this list. Iterate the list and discard all duplicates, and keep only the first occurance of each element. This runs in O(n*W)
, where n
is the number of elements in the list, and W
is the average weight (pseudo-polynomial solution).
O(nlogn) solution: (significantly harder to program)
Second approach would be to create a list of sum of weights of the elements:
sum[i] = weight[0] + ... + weight[i]
Now, draw a number, between 0
to sum[n]
, and chose the first element whose sum
is greater/equals this random number.
This will be the next element, discard the element, recreate the list, and repeat.
This runs in O(n^2*logn)
It can be further enhanced by creating a binary tree rather than a list, where each node also stores the value of weights of the entire subtree.
Now, after choosing an element, find the matching element (whose sum up to him is the first one higher than the random selected number), delete the node, and recalculate the weights on the path to the route.
This will take O(n)
to create the tree, O(logn)
to find the node at each step, and O(logn)
to recalculate the sum. Repeat it until the tree is exhausted, and you get O(nlogn)
solution.
The idea of this approach is very similar to Order Statistics Trees, but using the sum of weights rather than the number of descendants. The search and balancing after deletion will be done simiarly to order statistics tree.
Explanation to constructing and using the binary tree.
Assume you have elements=[a,b,c,d,e,f,g,h,i,j,k,l,m]
with weights=[1,2,3,1,2,3,1,2,3,1,2,3,1]
First construct an almost full binary tree, and populate the elements in it. Note that the tree is NOT Binary search tree, just a regular tree, so order of elements does not matter - and we won't need to maintain it later on.
You will get something like the following tree:
Legend: w - weight of that node, sw - sum of weight for the entire subtree.
Next, calculate sum of weights for each subtree. Start from the leaves, and calculate s.w = w
. For every other node calculate s.w = left->s.w + right->s.w
, filling the tree from the bottom up (post order traversal).
Building the tree, populating it, and calculating s.w.
for each node is done in O(n)
.
Now, you iteratively need to chose a random number between 0 to sum of weights (the s.w. value of the root, in our case 25). Let that number be r
, and find for each such number the matching node.
Finding the matching node is done recursively
if `r< root.left.sw`:
go to left son, and repeat.
else if `r<root.left.sw + root.w`:
the node you are seeking is the root, choose it.
else:
go to `root.right` with `r= r-root.left.sw - root.w`
Example, chosing r=10
:
Is r<root.left.sw? Yes. Recursively invoke with r=10,root=B (left child)
Is r<root.left.sw No. Is r < root.left.sw + root.w? No. Recursively invoke with r=10-6-2=2, and root=E (right chile)
Is r<root.left.sw? No. Is r < root.left.sw + root.w? Yes. Choose E as next node.
This is done in O(h) = O(logn)
per iteration.
Now, you need to delete that node, and reset the weights of the tree.
One approach to deleting that ensures the tree is in logarithmic weight is smilar to binary heap: Replace the chosen node with the bottom rightest node, remove the new rightest bottom node, and rebalance the two branches going from the two involved nodes to the tree.
First switch:
Then recalculate:
Note that recalculation is needed only to two paths, each of depth at most O(logn)
(the nodes colored orange in the pic), so deletion and recalculation is also O(logn)
.
Now, you got yourself a new binary tree, with the modified weights, and you are ready to choose the next candidate, until the tree is exhausted.
I would shuffle the array as follows:
Code
def weighted_shuffle(array)
arr = array.sort_by { |h| -h[:weight] }
tot_wt = arr.reduce(0) { |t,h| t += h[:weight] }
ndx_left = arr.each_index.to_a
arr.size.times.with_object([]) do |_,a|
cum = 0
rn = (tot_wt>0) ? rand(tot_wt) : 0
ndx = ndx_left.find { |i| rn <= (cum += arr[i][:weight]) }
a << arr[ndx]
tot_wt -= arr[ndx_left.delete(ndx)][:weight]
end
end
Examples
elements = [
{ :id => "ID_1", :weight => 100 },
{ :id => "ID_2", :weight => 200 },
{ :id => "ID_3", :weight => 600 }
]
def display(arr,n)
n.times.with_object([]) { |_,a|
p weighted_shuffle(arr).map { |h| h[:id] } }
end
display(elements,10)
["ID_3", "ID_2", "ID_1"]
["ID_1", "ID_3", "ID_2"]
["ID_1", "ID_3", "ID_2"]
["ID_3", "ID_2", "ID_1"]
["ID_3", "ID_2", "ID_1"]
["ID_2", "ID_3", "ID_1"]
["ID_2", "ID_3", "ID_1"]
["ID_3", "ID_1", "ID_2"]
["ID_3", "ID_1", "ID_2"]
["ID_3", "ID_2", "ID_1"]
n = 10_000
pos = elements.each_index.with_object({}) { |i,pos| pos[i] = Hash.new(0) }
n.times { weighted_shuffle(elements).each_with_index { |h,i|
pos[i][h[:id]] += 1 } }
pos.each { |_,h| h.each_key { |k| h[k] = (h[k]/n.to_f).round(3) } }
#=> {0=>{"ID_3"=>0.661, "ID_2"=>0.224, "ID_1"=>0.115},
# 1=>{"ID_2"=>0.472, "ID_3"=>0.278, "ID_1"=>0.251},
# 2=>{"ID_1"=>0.635, "ID_2"=>0.304, "ID_3"=>0.061}}
This says that, of the 10,000 times weighted_shuffle
was called, the first element selected was `"ID_3" 66.1% of the time, "ID_2" 22.4% percent of the time and "ID_1" the remaining 11.5% of the time. "ID_2" was selected second 47.2% of the times, and so on.
Explanation
arr
is the array of hashes to be shuffled. The shuffle is performed in arr.size
steps. At each step I randomly draw an element of arr
, without replacement, using the weights provided. If h[:weight]
sums to tot
for all elements h
of arr
that have not been previously selected, the probability of any one of those hashes h
being selected is h[:weight]/tot
. The selection at each step is done by finding the first cumulative probability p
for which rand(tot) <= p
. This last step is made more efficient by pre-sorting element
's elements by declining weight, which is done in the first step of the method:
elements.sort_by { |h| -h[:weight] }
#=> [{ :id => "ID_3", :weight => 600 },
# { :id => "ID_2", :weight => 200 },
# { :id => "ID_1", :weight => 100 }]
This is implemented using an array of indices of arr
, called ndx_left
, over which the iteration is performed. After a hash h
at index i
is selected, tot
is updated by subtracting h[:weight]
and i
is deleted from ndx_left
.
Variant
The following is a variant of the method above:
def weighted_shuffle_variant(array)
arr = array.sort_by { |h| -h[:weight] }
tot_wt = arr.reduce(0) { |t,h| t += h[:weight] }
n = arr.size
n.times.with_object([]) do |_,a|
cum = 0
rn = (tot_wt>0) ? rand(tot_wt) : 0
h, ndx = arr.each_with_index.find { |h,_| rn <= (cum += h[:weight]) }
a << h
tot_wt -= h[:weight]
arr[ndx] = arr.pop
end
end
Rather than maintaining an array of indices of elements in arr
which have not yet been selected, arr
is modified in place and reduced in size by one when each element is selected. If the element arr[i]
is selected, the last element is copied to offset i
and the last element of arr
is removed:
arr[i] = arr.pop
Benchmark
The approach of replicating each element h
of elements
h[:weight]
times, shuffling then uniq
ifying the result is excruciatingly inefficient. If that's not obvious, here's a benchmark. I've compared my weighted_shuffle
with @Mori's solution which is representative of the "replicate, shuffle, delete" approach:
def mori_shuffle(array)
array.flat_map { |h| [h[:id]] * h[:weight] }.shuffle.uniq
end
require 'benchmark'
def test_em(nelements, ndigits)
puts "\nelements.size=>#{nelements}, weights have #{ndigits} digits\n\n"
mx = 10**ndigits
elements = nelements.times.map { |i| { id: i, weight: rand(mx) } }
Benchmark.bm(15 "mori_shuffle", "weighted_shuffle") do |x|
x.report { mori_shuffle(elements) }
x.report { weighted_shuffle(elements) }
end
end
elements.size=>3, weights have 1 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000068)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000051)
elements.size=>3, weights have 2 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000035)
weighted_shuffle 0.010000 0.000000 0.010000 ( 0.000026)
elements.size=>3, weights have 3 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000161)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000027)
elements.size=>3, weights have 4 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000854)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000026)
elements.size=>20, weights have 2 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000089)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000090)
elements.size=>20, weights have 3 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000771)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000071)
elements.size=>20, weights have 4 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.005895)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000073)
elements.size=>100, weights have 2 digits
user system total real
mori_shuffle 0.000000 0.000000 0.000000 ( 0.000446)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000683)
elements.size=>100, weights have 3 digits
user system total real
mori_shuffle 0.010000 0.000000 0.010000 ( 0.003765)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000659)
elements.size=>100, weights have 4 digits
user system total real
mori_shuffle 0.030000 0.010000 0.040000 ( 0.034982)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000638)
elements.size=>100, weights have 5 digits
user system total real
mori_shuffle 0.550000 0.040000 0.590000 ( 0.593190)
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000623)
elements.size=>100, weights have 6 digits
user system total real
mori_shuffle 5.560000 0.380000 5.940000 ( 5.944749)
weighted_shuffle 0.010000 0.000000 0.010000 ( 0.000636)
Comparison of weighted_shuffle
and weighted_shuffle_variant
Considering that the benchmark engine is all warmed up, I may as well compare the two methods I suggested. The results are similar, with weighted_shuffle
having a consistent edge. Here are some typical results:
elements.size=>20, weights have 3 digits
user system total real
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000062)
weighted_shuffle_variant 0.000000 0.000000 0.000000 ( 0.000108)
elements.size=>20, weights have 4 digits
user system total real
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000060)
weighted_shuffle_variant 0.000000 0.000000 0.000000 ( 0.000089)
elements.size=>100, weights have 2 digits
user system total real
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000666)
weighted_shuffle_variant 0.000000 0.000000 0.000000 ( 0.000871)
elements.size=>100, weights have 4 digits
user system total real
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000625)
weighted_shuffle_variant 0.000000 0.000000 0.000000 ( 0.000803)
elements.size=>100, weights have 6 digits
user system total real
weighted_shuffle 0.000000 0.000000 0.000000 ( 0.000664)
weighted_shuffle_variant 0.000000 0.000000 0.000000 ( 0.000773)
As compared to weighted_shuffle
, weighted_shuffle_variant
does not maintain an array of indices of elements of (a copy of) elements
that have not yet been selected (a time-saver). Instead, it replaces the selected element in the array with the last element of the array and then pop
s the last element, causing the size of the array to decrease by one at each step. Unfortunately, that destroys the ordering of elements by decreasing weight. By contrast, weighted_shuffle
maintains the optimization of considering elements by decreasing order of weight. On balance, the latter tradeoff appears to be more important than the former.
I have my solution but I think it can be improved:
module Utils
def self.random_suffle_with_weight(elements, &proc)
# Create a consecutive chain of element
# on which every element is represented
# as many times as its weight.
consecutive_chain = []
elements.each do |element|
proc.call(element).times { consecutive_chain << element }
end
# Choosine one element randomly from
# the consecutive_chain and remove it for the next round
# until all elements has been chosen.
shorted_elements = []
while(shorted_elements.length < elements.length)
random_index = Kernel.rand(consecutive_chain.length)
selected_element = consecutive_chain[random_index]
shorted_elements << selected_element
consecutive_chain.delete(selected_element)
end
shorted_elements
end
end
Test:
def test_random_suffle_with_weight
element_1 = { :id => "ID_1", :weight => 10 }
element_2 = { :id => "ID_2", :weight => 20 }
element_3 = { :id => "ID_3", :weight => 60 }
elements = [element_1, element_2, element_3]
Kernel.expects(:rand).with(90).returns(11)
Kernel.expects(:rand).with(70).returns(1)
Kernel.expects(:rand).with(60).returns(50)
assert_equal([element_2, element_1, element_3], Utils.random_suffle_with_weight(elements) { |e| e[:weight] })
end
Weighted Random Sampling (2005; Efraimidis, Spirakis) provides a very elegant algorithm for this. The implementation is super simple and runs in O(n log(n))
:
def weigthed_shuffle(items, weights):
order = sorted(range(len(items)), key=lambda i: -random.random() ** (1.0 / weights[i]))
return [items[i] for i in order]
elements.flat_map { |h| [h[:id]] * h[:weight] }.shuffle.uniq