To store big matrix on disk I use numpy.memmap.
Here is a sample code to test big matrix multiplication:
import numpy as np
import time
rows= 10000 # it
Dask.array provides a numpy interface to large on-disk arrays using blocked algorithms and task scheduling. It can easily do out-of-core matrix multiplies and other simple-ish numpy operations.
Blocked linear algebra is harder and you might want to check out some of the academic work on this topic. Dask does support QR and SVD factorizations on tall-and-skinny matrices.
Regardless for large arrays, you really want blocked algorithms, not naive traversals which will hit disk in unpleasant ways.
Consider using NumExpr for your processing: https://github.com/pydata/numexpr
... internally, NumExpr employs its own vectorized virtual machine that is designed around a chunked-read strategy, in order to efficiently operate on optimally-sized blocks of data in memory. It can handily beat naïve NumPy operations if tuned properly.
NumExpr may cover #2 in your breakdown of the issue. If you address #1 by using a streamable binary format, you can then the chunked-read approach when loading your data files – like so:
with open('path/to/your-data.bin', 'rb') as binary:
while True:
chunk = binary.read(4096) # or what have you
if not chunk:
break
If that is too low-level for you, I would recommend you look at the HDF5 library and format: http://www.h5py.org – it’s the best solution for the binary serialization of NumPy-based structures that I know of. The h5py
module supports compression, chunked reading, dtypes, metadata… you name it.
Good luck!