Like many others, my situation is that I have a class which collects a large amount of data, and provides a method to return the data as a numpy array. (Additional data can con
I will use array.array()
to do the data collection:
import array
a = array.array("d")
for i in xrange(100):
a.append(i*2)
Every time when you want to do some calculation with the collected data, convert it to numpy.ndarray
by numpy.frombuffer
:
b = np.frombuffer(a, dtype=float)
print np.mean(b)
b
will share data memory with a
, so the convertion is very fast.
The resize
method has two main problems. The first is that you return a reference to self._arr when the user calls get_data_as_array
. Now the resize will do one of two things depending on your implementation. It'll either modify the array you've given you're user ie the user will take a.shape
and the shape will unpredictably change. Or it'll corrupt that array, having it point to bad memory. You could solve that issue by always having get_data_as_array
return self._arr.copy()
, but that brings me to the second issue. resize
is acctually not very efficient. I believe in general, resize has to allocate new memory and do a copy every time it is called to grow an array. Plus now you need to copy the array every time you want to return it to your user.
Another approach would be to design your own dynamic array, that would look something like:
class DynamicArray(object):
_data = np.empty(1)
data = _data[:0]
len = 0
scale_factor = 2
def append(self, values):
old_data = len(self.data)
total_data = len(values) + old_data
total_storage = len(self._data)
if total_storage < total_data:
while total_storage < total_data:
total_storage = np.ceil(total_storage * self.scale_factor)
self._data = np.empty(total_storage)
self._data[:old_data] = self.data
self._data[old_data:total_data] = values
self.data = self._data[:total_data]
This should be very fast because you only need to grow the array log(N) times and you use at most 2*N-1 storage where N is the max size of the array. Other than growing the array, you're just making views of _data
which doesn't involve any copying and should be constant time.
Hope this is useful.