NumPy is an extremely useful library, and from using it I\'ve found that it\'s capable of handling matrices which are quite large (10000 x 10000) easily, but begins to strug
Usually when we deal with large matrices we implement them as Sparse Matrices.
I don't know if numpy supports sparse matrices but I found this instead.
It's a bit alpha, but http://blaze.pydata.org/ seems to be working on solving this.
To handle sparse matrices, you need the scipy
package that sits on top of numpy
-- see here for more details about the sparse-matrix options that scipy
gives you.
PyTables and NumPy are the way to go.
PyTables will store the data on disk in HDF format, with optional compression. My datasets often get 10x compression, which is handy when dealing with tens or hundreds of millions of rows. It's also very fast; my 5 year old laptop can crunch through data doing SQL-like GROUP BY aggregation at 1,000,000 rows/second. Not bad for a Python-based solution!
Accessing the data as a NumPy recarray again is as simple as:
data = table[row_from:row_to]
The HDF library takes care of reading in the relevant chunks of data and converting to NumPy.
Sometimes one simple solution is using a custom type for your matrix items. Based on the range of numbers you need, you can use a manual dtype
and specially smaller for your items. Because Numpy considers the largest type for object by default this might be a helpful idea in many cases. Here is an example:
In [70]: a = np.arange(5)
In [71]: a[0].dtype
Out[71]: dtype('int64')
In [72]: a.nbytes
Out[72]: 40
In [73]: a = np.arange(0, 2, 0.5)
In [74]: a[0].dtype
Out[74]: dtype('float64')
In [75]: a.nbytes
Out[75]: 32
And with custom type:
In [80]: a = np.arange(5, dtype=np.int8)
In [81]: a.nbytes
Out[81]: 5
In [76]: a = np.arange(0, 2, 0.5, dtype=np.float16)
In [78]: a.nbytes
Out[78]: 8
Are you asking how to handle a 2,500,000,000 element matrix without terabytes of RAM?
The way to handle 2 billion items without 8 billion bytes of RAM is by not keeping the matrix in memory.
That means much more sophisticated algorithms to fetch it from the file system in pieces.