问题
I'm looking for the more compact way to store boolean.
numpy internally need 8bits to store one boolean, but np.packbits
allow to pack
them, that's pretty cool.
The problem is that to pack in a 4e6 bytes array a 32e6 bytes array of boolean we need to first spend 256e6 bytes to convert the boolean array in int array !
In [1]: db_bool = np.array(np.random.randint(2, size=(int(2e6), 16)), dtype=bool)
In [2]: db_int = np.asarray(db_bool, dtype=int)
In [3]: db_packed = np.packbits(db_int, axis=0)
In [4]: db.nbytes, db_int.nbytes, db_packed.nbytes
Out[5]: (32000000, 256000000, 4000000)
There is a one year old issue opened in the numpy tracker about that (Cf. https://github.com/numpy/numpy/issues/5377 )
Has someone a solution/better workaround ?
The traceback when we try to do it the right way:
In [28]: db_pb = np.packbits(db_bool)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-3715e167166b> in <module>()
----> 1 db_pb = np.packbits(db_bool)
TypeError: Expected an input array of integer data type
In [29]:
PS: I will give bitarray a try but would have get it in pure numpy.
回答1:
There's no need to convert your boolean array to the native int
dtype (which will be 64 bit on x86_64). You can avoid copying your boolean array by viewing it as np.uint8
, which also uses a single byte per element:
packed = np.packbits(db_bool.view(np.uint8))
unpacked = np.unpackbits(packed)[:db_bool.size].reshape(db_bool.shape).view(np.bool)
print(np.all(db_bool == unpacked))
# True
Also, np.packbits
should now work directly on boolean arrays as of this commit from over a year ago (numpy v1.10.0 and newer).
回答2:
Just yesterday, I answered a question to a newcomer on how to deal with bits in Python - as compared to C++. After warning there would be no speed gains, I sketched-up a naive "bitarray" using internally Python's bytearray objects.
This is in no way fast - but if you are no longer operating on your array bits, and just want the output, maybe it is good enough - as you have full control in Python code about the conversion. Otherwise, you can try just hinting the static types and run the same code as Cython, and you will probably want to use an np array with dtype=int8 instead of a bytearray:
class BitArray(object):
def __init__(self, length):
self.values = bytearray(b"\x00" * (length // 8 + (1 if length % 8 else 0)))
self.length = length
def __setitem__(self, index, value):
value = int(bool(value)) << (7 - index % 8)
mask = 0xff ^ (7 - index % 8)
self.values[index // 8] &= mask
self.values[index // 8] |= value
def __getitem__(self, index):
mask = 1 << (7 - index % 8)
return bool(self.values[index // 8] & mask)
def __len__(self):
return self.length
def __repr__(self):
return "<{}>".format(", ".join("{:d}".format(value) for value in self))
This code was originally posted here: Is there a builtin bitset in Python that's similar to the std::bitset from C++?
来源:https://stackoverflow.com/questions/34511362/packing-boolean-array-needs-go-throught-int-numpy-1-8-2