Numpy efficient big matrix multiplication

后端 未结 2 1888
温柔的废话
温柔的废话 2021-02-03 14:02

To store big matrix on disk I use numpy.memmap.

Here is a sample code to test big matrix multiplication:

import numpy as np
import time

rows= 10000 # it         


        
相关标签:
2条回答
  • 2021-02-03 14:38

    Dask.array provides a numpy interface to large on-disk arrays using blocked algorithms and task scheduling. It can easily do out-of-core matrix multiplies and other simple-ish numpy operations.

    Blocked linear algebra is harder and you might want to check out some of the academic work on this topic. Dask does support QR and SVD factorizations on tall-and-skinny matrices.

    Regardless for large arrays, you really want blocked algorithms, not naive traversals which will hit disk in unpleasant ways.

    0 讨论(0)
  • 2021-02-03 14:46

    Consider using NumExpr for your processing: https://github.com/pydata/numexpr

    ... internally, NumExpr employs its own vectorized virtual machine that is designed around a chunked-read strategy, in order to efficiently operate on optimally-sized blocks of data in memory. It can handily beat naïve NumPy operations if tuned properly.

    NumExpr may cover #2 in your breakdown of the issue. If you address #1 by using a streamable binary format, you can then the chunked-read approach when loading your data files – like so:

        with open('path/to/your-data.bin', 'rb') as binary:
            while True:
                chunk = binary.read(4096) # or what have you
                if not chunk:
                    break
    

    If that is too low-level for you, I would recommend you look at the HDF5 library and format: http://www.h5py.org – it’s the best solution for the binary serialization of NumPy-based structures that I know of. The h5py module supports compression, chunked reading, dtypes, metadata… you name it.

    Good luck!

    0 讨论(0)
提交回复
热议问题