Very large matrices using Python and NumPy

后端 未结 11 1815
难免孤独
难免孤独 2020-11-22 13:51

NumPy is an extremely useful library, and from using it I\'ve found that it\'s capable of handling matrices which are quite large (10000 x 10000) easily, but begins to strug

相关标签:
11条回答
  • 2020-11-22 14:16

    Usually when we deal with large matrices we implement them as Sparse Matrices.

    I don't know if numpy supports sparse matrices but I found this instead.

    0 讨论(0)
  • 2020-11-22 14:22

    It's a bit alpha, but http://blaze.pydata.org/ seems to be working on solving this.

    0 讨论(0)
  • 2020-11-22 14:24

    To handle sparse matrices, you need the scipy package that sits on top of numpy -- see here for more details about the sparse-matrix options that scipy gives you.

    0 讨论(0)
  • 2020-11-22 14:28

    PyTables and NumPy are the way to go.

    PyTables will store the data on disk in HDF format, with optional compression. My datasets often get 10x compression, which is handy when dealing with tens or hundreds of millions of rows. It's also very fast; my 5 year old laptop can crunch through data doing SQL-like GROUP BY aggregation at 1,000,000 rows/second. Not bad for a Python-based solution!

    Accessing the data as a NumPy recarray again is as simple as:

    data = table[row_from:row_to]
    

    The HDF library takes care of reading in the relevant chunks of data and converting to NumPy.

    0 讨论(0)
  • 2020-11-22 14:30

    Sometimes one simple solution is using a custom type for your matrix items. Based on the range of numbers you need, you can use a manual dtype and specially smaller for your items. Because Numpy considers the largest type for object by default this might be a helpful idea in many cases. Here is an example:

    In [70]: a = np.arange(5)
    
    In [71]: a[0].dtype
    Out[71]: dtype('int64')
    
    In [72]: a.nbytes
    Out[72]: 40
    
    In [73]: a = np.arange(0, 2, 0.5)
    
    In [74]: a[0].dtype
    Out[74]: dtype('float64')
    
    In [75]: a.nbytes
    Out[75]: 32
    

    And with custom type:

    In [80]: a = np.arange(5, dtype=np.int8)
    
    In [81]: a.nbytes
    Out[81]: 5
    
    In [76]: a = np.arange(0, 2, 0.5, dtype=np.float16)
    
    In [78]: a.nbytes
    Out[78]: 8
    
    0 讨论(0)
  • 2020-11-22 14:30

    Are you asking how to handle a 2,500,000,000 element matrix without terabytes of RAM?

    The way to handle 2 billion items without 8 billion bytes of RAM is by not keeping the matrix in memory.

    That means much more sophisticated algorithms to fetch it from the file system in pieces.

    0 讨论(0)
提交回复
热议问题