Why does numpy.random.choice not use arithmetic coding?

问题

If I evaluate something like:

numpy.random.choice(2, size=100000, p=[0.01, 0.99])

using one uniformly-distributed random float, say r, and deciding if r < 0.01 will presumably waste many of the random bits (entropy) generated. I've heard (second-hand) that generating psuedo-random numbers is computationally expensive, so I assumed that numpy would not be doing that, and rather would use a scheme like arithmetic coding in this case.

However, at first glance it appears that choice does indeed generate a float for every sample it is asked for. Further, a quick timeit experiment shows that generating n uniform floats is actually quicker than n samples from p=[0.01, 0.99].

>>> timeit.timeit(lambda : numpy.random.choice(2, size=100000, p=[0.01, 0.99]), number=1000)
1.74494537999999
>>> timeit.timeit(lambda : numpy.random.random(size=100000), number=1000)
0.8165735180009506

Does choice really generate a float for each sample, as it would appear? Would it not significantly improve performance to use some compression algorithm in some cases (particularly if size is large and p is distributed unevenly)? If not, why not?

回答1:

Since NumPy 1.17, the reason is largely backward compatibility. See also this question and this question.

As of NumPy 1.17, numpy.random.* functions, including numpy.random.choice, are legacy functions and "SHALL remain the same as they currently are", according to NumPy's new RNG policy, which also introduced a new random generation system for NumPy. The reasons for making them legacy functions include the recommendation to avoid global state. Even so, however, NumPy did not deprecate any numpy.random.* functions in version 1.17, although a future version of NumPy might.

Recall that in your examples, numpy.random.choice takes an array of floats as weights. An array of integer weights would lead to more exact random number generation. And although any float could be converted to a rational number (leading to rational-valued weights and thus integer weights), the legacy NumPy version appears not to do this. These and other implementation decisions in numpy.random.choice can't be changed without breaking backward compatibility.

By the way, arithmetic coding is not the only algorithm that seeks to avoid wasting bits. Perhaps the canonical algorithm for sampling for a discrete distribution is the Knuth and Yao algorithm (1976), which exactly chooses a random integer based on the binary expansion of the probabilities involved, and treats the problem as a random walk on a binary tree. (This algorithm uses, on average, up to 2 bits away from the theoretical lower bound.) Any other integer generating algorithm can be ultimately described in the same way, namely as a random walk on a binary tree. For example, the Fast Loaded Dice Roller is a recent algorithm that has a guaranteed bound on the average number of bits it uses (in this case, no more than 6 bits away from the theoretical lower bound). The Han and Hoshi algorithm (from 1997) is another of this kind, but uses cumulative probabilities. See also my section, "Weighted Choice With Replacement".

来源：https://stackoverflow.com/questions/63180186/why-does-numpy-random-choice-not-use-arithmetic-coding

标签

python

python-3.x

performance

numpy

random