c++ discrete distribution sampling with frequently changing probabilities

问题

Problem: I need to sample from a discrete distribution constructed of certain weights e.g. {w1,w2,w3,..}, and thus probability distribution {p1,p2,p3,...}, where pi=wi/(w1+w2+...).

some of wi's change very frequently, but only a very low proportion of all wi's. But the distribution itself thus has to be renormalised every time it happens, and therefore I believe Alias method does not work efficiently because one would need to build the whole distribution from scratch every time.

The method I am currently thinking is a binary tree (heap method), where all wi's are saved in the lowest level, and then the sum of each two in higher level and so on. The sum of all of them will be in the highest level, which is also a normalisation constant. Thus in order to update the tree after change in wi, one needs to do log(n) changes, as well as the same amount to get the sample from the distribution.

Question:

Q1. Do you have a better idea on how to achieve it faster? Q2. The most important part: I am looking for a library which has already done this.

explanation: I have done this myself several years ago, by building heap structure in a vector, but since then I have learned many things including discovering libraries ( :) ), and containers such as map... Now I need to rewrite that code with higher functionality, and I want to make it right this time:

so Q2.1 is there a nice way to make a c++ map ordered and searched not by index, but by a cumulative sum of it's elements (this is how we sample, right?..). (that is my current theory how I would like to do it, but it doesnt have to be this way...)

Q2.2 Maybe there is some even nicer way to do the same? I would believe this problem is so frequent that I am very surprised I could not find some sort of library which would do it for me...

Thank you very much, and I am very sorry if this has been asked in some other form, please direct me towards it, but I have spent a good while looking...

-z

Edit: There is a possibility that I might need to remove or add the elements as well, but I think I could avoid it, if that makes a huge difference, thus leaving only changing the value of the weights.

Edit2: weights are reals in general, I would have to think if I could make them integers...

回答1:

I would actually use a hash set of strings (don't remember the C++ container for it, you might need to implement your own though). Put wi elements for each i, with the values "w1_1", "w1_2",... all through "w1_[w1]" (that is, w1 elements starting with "w1_").

When you need to sample, pick an element at random using a uniform distribution. If you picked w5_*, say you picked element 5. Because of the number of elements in the hash, this will give you the distribution you were looking for.

Now, when wi changes from A to B, just add B-A elements to the hash (if B>A), or remove the last A-B elements of wi (if A>B).

Adding new elements and removing old elements is trivial in this case.

Obviously the problem is 'pick an element at random'. If your hash is a closed hash, you pick an array cell at random, if it's empty - just pick one at random again. If you keep your hash 3 or 4 times larger than the total sum of weights, your complexity will be pretty good: O(1) for retrieving a random sample, O(|A-B|) for modifying the weights.

Another option, since only a small part of your weights change, is to split the weights into two - the fixed part and the changed part. Then you only need to worry about changes in the changed part, and the difference between the total weight of changed parts and the total weight of unchanged parts. Then for the fixed part your hash becomes a simple array of numbers: 1 appears w1 times, 2 appears w2 times, etc..., and picking a random fixed element is just picking a random number.

回答2:

Updating your normalisation factor when you change a value is trivial. This might suggest an algorithm.

w_sum = w_sum_old - w_i_old + w_i_new;

If you leave p_i as a computed property p_i = w_i / w_sum you would avoid recalculating the entire p_i array at the cost of calculating p_i every time they are needed. You would, however, be able to update many statistical properties without recalculating the entire sum

expected_something = (something_1 * w_1 + something_2 * w_2 + ...) / w_sum;

With a bit of algebra you can update expected_something by subtracting the contribution with the old weight and add the contribution with the new weight, multiplying and dividing with the normalization factors as required.

If you during the sampling keep track of which outcomes that are part of the sample, it would be possible to propagate how the probabilities were updated to the generated sample. Would this make it possible for you to update rather than recalculate values related to the sample? I think a bitmap could provide an efficient way to store an index of which outcomes that were used to build the sample.

One way of storing the probabilities together with the sums is to start with all probabilities. In the next N/2 positions you store the sums of the pairs. After that N/4 sums of the pairs etc. Where the sums are located can, obviously, be calculate in O(1) time. This data-structure is sort of a heap, but upside down.

来源：https://stackoverflow.com/questions/25189406/c-discrete-distribution-sampling-with-frequently-changing-probabilities

标签

c++

statistics

distribution

probability

sampling