numpy memmap memory usage - want to iterate once

问题

let say I have some big matrix saved on disk. storing it all in memory is not really feasible so I use memmap to access it

A = np.memmap(filename, dtype='float32', mode='r', shape=(3000000,162))

now let say I want to iterate over this matrix (not essentially in an ordered fashion) such that each row will be accessed exactly once.

p = some_permutation_of_0_to_2999999()

I would like to do something like that:

start = 0
end = 3000000
num_rows_to_load_at_once = some_size_that_will_fit_in_memory()
while start < end:
    indices_to_access = p[start:start+num_rows_to_load_at_once]
    do_stuff_with(A[indices_to_access, :])
    start = min(end, start+num_rows_to_load_at_once)

as this process goes on my computer is becoming slower and slower and my RAM and virtual memory usage is exploding.

Is there some way to force np.memmap to use up to a certain amount of memory? (I know I won't need more than the amount of rows I'm planning to read at a time and that caching won't really help me since I'm accessing each row exactly once)

Maybe instead is there some other way to iterate (generator like) over a np array in a custom order? I could write it manually using file.seek but it happens to be much slower than np.memmap implementation

do_stuff_with() does not keep any reference to the array it receives so no "memory leaks" in that aspect

thanks

回答1:

This has been an issue that I've been trying to deal with for a while. I work with large image datasets and numpy.memmap offers a convenient solution for working with these large sets.

However, as you've pointed out, if I need to access each frame (or row in your case) to perform some operation, RAM usage will max out eventually.

Fortunately, I recently found a solution that will allow you to iterate through the entire memmap array while capping the RAM usage.

Solution:

import numpy as np

# create a memmap array
input = np.memmap('input', dtype='uint16', shape=(10000,800,800), mode='w+')

# create a memmap array to store the output
output = np.memmap('output', dtype='uint16', shape=(10000,800,800), mode='w+')

def iterate_efficiently(input, output, chunk_size):
    # create an empty array to hold each chunk
    # the size of this array will determine the amount of RAM usage
    holder = np.zeros([chunk_size,800,800], dtype='uint16')

    # iterate through the input, replace with ones, and write to output
    for i in range(input.shape[0]):
        if i % chunk_size == 0:
            holder[:] = input[i:i+chunk_size] # read in chunk from input
            holder += 5 # perform some operation
            output[i:i+chunk_size] = holder # write chunk to output

def iterate_inefficiently(input, output):
    output[:] = input[:] + 5

Timing Results:

In [11]: %timeit iterate_efficiently(input,output,1000)
1 loop, best of 3: 1min 48s per loop

In [12]: %timeit iterate_inefficiently(input,output)
1 loop, best of 3: 2min 22s per loop

The size of the array on disk is ~12GB. Using the iterate_efficiently function keeps the memory usage to 1.28GB whereas the iterate_inefficiently function eventually reaches 12GB in RAM.

This was tested on Mac OS.

回答2:

I've been experimenting with this problem for a couple days now and it appears there are two ways to control memory consumption using np.mmap. The first is reliable while the second would require some testing and will be OS dependent.

Option 1 - reconstruct the memory map with each read / write:

def MoveMMapNPArray(data, output_filename):
    CHUNK_SIZE = 4096
    for idx in range(0,x.shape[1],CHUNK_SIZE):
        x = np.memmap(data.filename, dtype=data.dtype, mode='r', shape=data.shape, order='F')
        y = np.memmap(output_filename, dtype=data.dtype, mode='r+', shape=data.shape, order='F')
        end = min(idx+CHUNK_SIZE, data.shape[1])
        y[:,idx:end] = x[:,idx:end]

Where data is of type np.memmap. This discarding of the memmap object with each read keeps the array from being collected into memory and will keep memory consumption very low if the chunk size is low. It likely introduces some CPU overhead but was found to be small on my setup (MacOS).

Option 2 - construct the mmap buffer yourself and provide memory advice

If you look at the np.memmap source code here, you can see that it is relatively simple to create your own memmapped numpy array relatively easily. Specifically, with the snippet:

mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
mmap_np_array = ndarray.__new__(subtype, shape, dtype=descr, buffer=mm, offset=array_offset, order=order)

Note this python mmap instance is stored as the np.memmap's private _mmap attribute.

With access to the python mmap object, and python 3.8, you can use its madvise method, described here.

This allows you to advise the OS to free memory where available. The various madvise constants are described here for linux, with some generic cross platform options specified.

The MADV_DONTDUMP constant looks promising but I haven't tested memory consumption with it like I have for option 1.

来源：https://stackoverflow.com/questions/45132940/numpy-memmap-memory-usage-want-to-iterate-once

标签

python

numpy

numpy-memmap