问题
If I evaluate something like:
numpy.random.choice(2, size=100000, p=[0.01, 0.99])
using one uniformly-distributed random float
, say r
, and deciding if r < 0.01
will presumably waste many of the random bits (entropy) generated. I've heard (second-hand) that generating psuedo-random numbers is computationally expensive, so I assumed that numpy
would not be doing that, and rather would use a scheme like arithmetic coding in this case.
However, at first glance it appears that choice
does indeed generate a float
for every sample it is asked for. Further, a quick timeit
experiment shows that generating n
uniform floats is actually quicker than n
samples from p=[0.01, 0.99]
.
>>> timeit.timeit(lambda : numpy.random.choice(2, size=100000, p=[0.01, 0.99]), number=1000)
1.74494537999999
>>> timeit.timeit(lambda : numpy.random.random(size=100000), number=1000)
0.8165735180009506
Does choice
really generate a float
for each sample, as it would appear? Would it not significantly improve performance to use some compression algorithm in some cases (particularly if size
is large and p
is distributed unevenly)? If not, why not?
回答1:
Since NumPy 1.17, the reason is largely backward compatibility. See also this question and this question.
As of NumPy 1.17, numpy.random.*
functions, including numpy.random.choice
, are legacy functions and "SHALL remain the same as they currently are", according to NumPy's new RNG policy, which also introduced a new random generation system for NumPy. The reasons for making them legacy functions include the recommendation to avoid global state. Even so, however, NumPy did not deprecate any numpy.random.*
functions in version 1.17, although a future version of NumPy might.
Recall that in your examples, numpy.random.choice
takes an array of float
s as weights. An array of integer weights would lead to more exact random number generation. And although any float
could be converted to a rational number (leading to rational-valued weights and thus integer weights), the legacy NumPy version appears not to do this. These and other implementation decisions in numpy.random.choice
can't be changed without breaking backward compatibility.
By the way, arithmetic coding is not the only algorithm that seeks to avoid wasting bits. Perhaps the canonical algorithm for sampling for a discrete distribution is the Knuth and Yao algorithm (1976), which exactly chooses a random integer based on the binary expansion of the probabilities involved, and treats the problem as a random walk on a binary tree. (This algorithm uses, on average, up to 2 bits away from the theoretical lower bound.) Any other integer generating algorithm can be ultimately described in the same way, namely as a random walk on a binary tree. For example, the Fast Loaded Dice Roller is a recent algorithm that has a guaranteed bound on the average number of bits it uses (in this case, no more than 6 bits away from the theoretical lower bound). The Han and Hoshi algorithm (from 1997) is another of this kind, but uses cumulative probabilities. See also my section, "Weighted Choice With Replacement".
来源:https://stackoverflow.com/questions/63180186/why-does-numpy-random-choice-not-use-arithmetic-coding