How can I efficiently read and write files that are too large to fit in memory?

后端 未结 2 1977
北荒
北荒 2021-02-12 15:37

I am trying to calculate the cosine similarity of 100,000 vectors, and each of these vectors has 200,000 dimensions.

From reading other questions I know that memmap, PyT

相关标签:
2条回答
  • 2021-02-12 15:41

    Memory maps are exactly what the name says: mappings of (virtual) disk sectors into memory pages. The memory is managed by the operating system on demand. If there is enough memory, the system keeps parts of the files in memory, maybe filling up the whole memory, if there is not enough left, the system may discard pages read from file or may swap them into swap space. Normally you can rely on the OS is as efficient as possible.

    0 讨论(0)
  • 2021-02-12 15:50

    In terms of memory usage, there's nothing particularly wrong with what you're doing at the moment. Memmapped arrays are handled at the level of the OS - data to be written is usually held in a temporary buffer, and only committed to disk when the OS deems it necessary. Your OS should never allow you to run out of physical memory before flushing the write buffer.

    I'd advise against calling flush on every iteration since this defeats the purpose of letting your OS decide when to write to disk in order to maximise efficiency. At the moment you're only writing individual float values at a time.


    In terms of IO and CPU efficiency, operating on a single line at a time is almost certainly suboptimal. Reads and writes are generally quicker for large, contiguous blocks of data, and likewise your calculation will probably be much faster if you can process many lines at once using vectorization. The general rule of thumb is to process as big a chunk of your array as will fit in memory (including any intermediate arrays that are created during your computation).

    Here's an example showing how much you can speed up operations on memmapped arrays by processing them in appropriately-sized chunks.

    Another thing that can make a huge difference is the memory layout of your input and output arrays. By default, np.memmap gives you a C-contiguous (row-major) array. Accessing wmat by column will therefore be very inefficient, since you're addressing non-adjacent locations on disk. You would be much better off if wmat was F-contiguous (column-major) on disk, or if you were accessing it by row.

    The same general advice applies to using HDF5 instead of memmaps, although bear in mind that with HDF5 you will have to handle all the memory management yourself.

    0 讨论(0)
提交回复
热议问题