Storing scipy sparse matrix as HDF5

前端 未结 2 1104
深忆病人
深忆病人 2021-02-05 21:09

I want to compress and store a humongous Scipy matrix in HDF5 format. How do I do this? I\'ve tried the below code:

a = csr_matrix((dat, (row, col)), shape=(9479         


        
相关标签:
2条回答
  • 2021-02-05 21:45

    You can use scipy.sparse.save_npz method

    Alternatively consider using Pandas.SparseDataFrame, but be aware that this method is very slow (thanks to @hpaulj for testing and pointing it out)

    Demo:

    generating sparse matrix and SparseDataFrame

    In [55]: import pandas as pd
    
    In [56]: from scipy.sparse import *
    
    In [57]: m = csr_matrix((20, 10), dtype=np.int8)
    
    In [58]: m
    Out[58]:
    <20x10 sparse matrix of type '<class 'numpy.int8'>'
            with 0 stored elements in Compressed Sparse Row format>
    
    In [59]: sdf = pd.SparseDataFrame([pd.SparseSeries(m[i].toarray().ravel(), fill_value=0)
        ...:                           for i in np.arange(m.shape[0])])
        ...:
    
    In [61]: type(sdf)
    Out[61]: pandas.sparse.frame.SparseDataFrame
    
    In [62]: sdf.info()
    <class 'pandas.sparse.frame.SparseDataFrame'>
    RangeIndex: 20 entries, 0 to 19
    Data columns (total 10 columns):
    0    20 non-null int8
    1    20 non-null int8
    2    20 non-null int8
    3    20 non-null int8
    4    20 non-null int8
    5    20 non-null int8
    6    20 non-null int8
    7    20 non-null int8
    8    20 non-null int8
    9    20 non-null int8
    dtypes: int8(10)
    memory usage: 280.0 bytes
    

    saving SparseDataFrame to HDF file

    In [64]: sdf.to_hdf('d:/temp/sparse_df.h5', 'sparse_df')
    

    reading from HDF file

    In [65]: store = pd.HDFStore('d:/temp/sparse_df.h5')
    
    In [66]: store
    Out[66]:
    <class 'pandas.io.pytables.HDFStore'>
    File path: d:/temp/sparse_df.h5
    /sparse_df            sparse_frame
    
    In [67]: x = store['sparse_df']
    
    In [68]: type(x)
    Out[68]: pandas.sparse.frame.SparseDataFrame
    
    In [69]: x.info()
    <class 'pandas.sparse.frame.SparseDataFrame'>
    Int64Index: 20 entries, 0 to 19
    Data columns (total 10 columns):
    0    20 non-null int8
    1    20 non-null int8
    2    20 non-null int8
    3    20 non-null int8
    4    20 non-null int8
    5    20 non-null int8
    6    20 non-null int8
    7    20 non-null int8
    8    20 non-null int8
    9    20 non-null int8
    dtypes: int8(10)
    memory usage: 360.0 bytes
    
    0 讨论(0)
  • 2021-02-05 21:49

    A csr matrix stores it's values in 3 arrays. It is not an array or array subclass, so h5py cannot save it directly. The best you can do is save the attributes, and recreate the matrix on loading:

    In [248]: M = sparse.random(5,10,.1, 'csr')
    In [249]: M
    Out[249]: 
    <5x10 sparse matrix of type '<class 'numpy.float64'>'
        with 5 stored elements in Compressed Sparse Row format>
    In [250]: M.data
    Out[250]: array([ 0.91615298,  0.49907752,  0.09197862,  0.90442401,  0.93772772])
    In [251]: M.indptr
    Out[251]: array([0, 0, 1, 2, 3, 5], dtype=int32)
    In [252]: M.indices
    Out[252]: array([5, 7, 5, 2, 6], dtype=int32)
    In [253]: M.data
    Out[253]: array([ 0.91615298,  0.49907752,  0.09197862,  0.90442401,  0.93772772])
    

    coo format has data, row, col attributes, basically the same as the (dat, (row, col)) you use to create your a.

    In [254]: M.tocoo().row
    Out[254]: array([1, 2, 3, 4, 4], dtype=int32)
    

    The new save_npz function does:

    arrays_dict = dict(format=matrix.format, shape=matrix.shape, data=matrix.data)
    if matrix.format in ('csc', 'csr', 'bsr'):
        arrays_dict.update(indices=matrix.indices, indptr=matrix.indptr)
    ...
    elif matrix.format == 'coo':
        arrays_dict.update(row=matrix.row, col=matrix.col)
    ...
    np.savez(file, **arrays_dict)
    

    In other words it collects the relevant attributes in a dictionary and uses savez to create the zip archive.

    The same sort of method could be used with a h5py file. More on that save_npz in a recent SO question, with links to the source code.

    save_npz method missing from scipy.sparse

    See if you can get this working. If you can create a csr matrix, you can recreate it from its attributes (or the coo equivalents). I can make a working example if needed.

    csr to h5py example

    import numpy as np
    import h5py
    from scipy import sparse
    
    M = sparse.random(10,10,.2, 'csr')
    print(repr(M))
    
    print(M.data)
    print(M.indices)
    print(M.indptr)
    
    f = h5py.File('sparse.h5','w')
    g = f.create_group('Mcsr')
    g.create_dataset('data',data=M.data)
    g.create_dataset('indptr',data=M.indptr)
    g.create_dataset('indices',data=M.indices)
    g.attrs['shape'] = M.shape
    f.close()
    
    f = h5py.File('sparse.h5','r')
    print(list(f.keys()))
    print(list(f['Mcsr'].keys()))
    
    g2 = f['Mcsr']
    print(g2.attrs['shape'])
    
    M1 = sparse.csr_matrix((g2['data'][:],g2['indices'][:],
        g2['indptr'][:]), g2.attrs['shape'])
    print(repr(M1))
    print(np.allclose(M1.A, M.A))
    f.close()
    

    producing

    1314:~/mypy$ python3 stack43390038.py 
    <10x10 sparse matrix of type '<class 'numpy.float64'>'
        with 20 stored elements in Compressed Sparse Row format>
    [ 0.13640389  0.92698959 ....  0.7762265 ]
    [4 5 0 3 0 2 0 2 5 6 7 1 7 9 1 3 4 6 8 9]
    [ 0  2  4  6  9 11 11 11 14 19 20]
    ['Mcsr']
    ['data', 'indices', 'indptr']
    [10 10]
    <10x10 sparse matrix of type '<class 'numpy.float64'>'
        with 20 stored elements in Compressed Sparse Row format>
    True
    

    coo alternative

    Mo = M.tocoo()
    g = f.create_group('Mcoo')
    g.create_dataset('data', data=Mo.data)
    g.create_dataset('row', data=Mo.row)
    g.create_dataset('col', data=Mo.col)
    g.attrs['shape'] = Mo.shape
    
    g2 = f['Mcoo']
    M2 = sparse.coo_matrix((g2['data'], (g2['row'], g2['col'])),
       g2.attrs['shape'])   # don't need the [:]
    # could also use sparse.csr_matrix or M2.tocsr()
    
    0 讨论(0)
提交回复
热议问题