I am trying to create a huge boolean
matrix which is randomly filled with True
and False
with a given probability p
. At f
So I tried to split it up into the generation of the single rows by doing this:
The way that np.random.choice
works is by first generating a float64
in [0, 1)
for every cell of your data, and then converting that into an index in your array using np.search_sorted
. This intermediate representation is 8 times larger than the boolean array!
Since your data is boolean, you can get a factor of two speedup with
np.random.rand(N, N) > p
Which naturally, you could use inside your looping solution
It seems like np.random.choice
could do with some buffering here - you might want to file an issue against numpy.
Another option would be to try and generate float32
s instead of float64
s. I'm not sure if numpy can do that right now, but you could request the feature.
Another possibility could be to generate it in a batch (i.e. compute many sub-arrays and stack them together at the very end). But, consider not to update one array (mask
) in a for
loop as OP is doing. This would force the whole array to load in main memory during every indexing update.
Instead for example: to get 30000x30000
, have 9000 100x100
separate arrays, update each of this 100x100
array accordingly in a for
loop and finally stack these 9000 arrays together in a giant array. This would definitely need not more than 4GB of RAM and would be very fast as well.
Minimal Example:
In [9]: a
Out[9]:
array([[0, 1],
[2, 3]])
In [10]: np.hstack([np.vstack([a]*5)]*5)
Out[10]:
array([[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[2, 3, 2, 3, 2, 3, 2, 3, 2, 3]])
In [11]: np.hstack([np.vstack([a]*5)]*5).shape
Out[11]: (10, 10)
The problem is your RAM, the values are being stored in memory as it's being created. I just created this matrix using this command:
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
I used an AWS i3
instance with 64GB of RAM and 8 cores. To create this matrix, htop
shows that it takes up ~20GB of RAM. Here is a benchmark in case you care:
time np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
CPU times: user 18.3 s, sys: 3.4 s, total: 21.7 s
Wall time: 21.7 s
def mask_method(N, p):
for i in range(N):
mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])
if (i % 100 == 0):
print(i)
time mask_method(N,p)
CPU times: user 20.9 s, sys: 1.55 s, total: 22.5 s
Wall time: 22.5 s
Note that the mask method only takes up ~9GB of RAM at it's peak.
Edit: The first method flushes the RAM after the process is done where as the function method retains all of it.