Numpy array larger than RAM: write to disk or out-of-core solution?

徘徊边缘 提交于 2021-01-01 04:50:37

问题


I have the following workflow, whereby I append data to an empty pandas Series object. (This empty array could also be a NumPy array, or even a basic list.)

in_memory_array = pd.Series([])

for df in list_of_pandas_dataframes:
    new = df.apply(lambda row: compute_something(row), axis=1)  ## new is a pandas.Series
    in_memory_array = in_memory_array.append(new)

My problem is that the resulting array in_memory_array becomes too large for RAM. I don't need to keep all objects in memory for this computation.

I think my options are somehow pickling objects to disk once the array gets too big for RAM, e.g.

# N = some size in bytes too large for RAM
if sys.getsizeof(in_memory_array) > N: 
    with open('mypickle.pickle', 'wb') as f:
        pickle.dump(in_memory_array, f)

Otherwise, is there an out-of-core solution? The best case scenario would be to create some cap such that the object cannot grow larger than X GB in RAM.


回答1:


Check out this python library : https://pypi.org/project/wendelin.core/ It allows you to work with arrays bigger than RAM and local disk.




回答2:


You could preprocess all of your dataframes as numpy arrays and save them to one or more npz files (I have limited experience with npz files, but I have not found a way to append to them. so if all of your data does not fit in RAM, you would have to create mutiple npz files) or compressed npz files if space is a concern, then access them as needed using memory mapping. When you load the npz as memory map it creates an object with the numpy array names with out loading the arrays into RAM until you access them. As an example:

def makeNPZ():
    z = np.zeros(100000)
    o = np.ones(100000)
    e = np.eye(100)

    dct = {'zero':z, 'one':o, 'eye':e}
    np.savez_compressed('TempZip.npz', **dct)

def useNPZ():
    return np.load('TempZip.npz', mmap_mode='r+')

makeNPZ()

memoryMap = useNPZ()

memoryMap.files
Out[6]: ['zero', 'one', 'eye']


memoryMap['one']
Out[11]: array([ 1.,  1.,  1., ...,  1.,  1.,  1.])


来源:https://stackoverflow.com/questions/60871793/numpy-array-larger-than-ram-write-to-disk-or-out-of-core-solution

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!