问题
I have the following workflow, whereby I append data to an empty pandas Series object. (This empty array could also be a NumPy array, or even a basic list.)
in_memory_array = pd.Series([])
for df in list_of_pandas_dataframes:
new = df.apply(lambda row: compute_something(row), axis=1) ## new is a pandas.Series
in_memory_array = in_memory_array.append(new)
My problem is that the resulting array in_memory_array
becomes too large for RAM. I don't need to keep all objects in memory for this computation.
I think my options are somehow pickling objects to disk once the array gets too big for RAM, e.g.
# N = some size in bytes too large for RAM
if sys.getsizeof(in_memory_array) > N:
with open('mypickle.pickle', 'wb') as f:
pickle.dump(in_memory_array, f)
Otherwise, is there an out-of-core solution? The best case scenario would be to create some cap such that the object cannot grow larger than X GB in RAM.
回答1:
Check out this python library : https://pypi.org/project/wendelin.core/ It allows you to work with arrays bigger than RAM and local disk.
回答2:
You could preprocess all of your dataframes as numpy arrays and save them to one or more npz files (I have limited experience with npz files, but I have not found a way to append to them. so if all of your data does not fit in RAM, you would have to create mutiple npz files) or compressed npz files if space is a concern, then access them as needed using memory mapping. When you load the npz as memory map it creates an object with the numpy array names with out loading the arrays into RAM until you access them. As an example:
def makeNPZ():
z = np.zeros(100000)
o = np.ones(100000)
e = np.eye(100)
dct = {'zero':z, 'one':o, 'eye':e}
np.savez_compressed('TempZip.npz', **dct)
def useNPZ():
return np.load('TempZip.npz', mmap_mode='r+')
makeNPZ()
memoryMap = useNPZ()
memoryMap.files
Out[6]: ['zero', 'one', 'eye']
memoryMap['one']
Out[11]: array([ 1., 1., 1., ..., 1., 1., 1.])
来源:https://stackoverflow.com/questions/60871793/numpy-array-larger-than-ram-write-to-disk-or-out-of-core-solution