Storing scipy sparse matrix as HDF5

前端 未结 2 1113
深忆病人
深忆病人 2021-02-05 21:09

I want to compress and store a humongous Scipy matrix in HDF5 format. How do I do this? I\'ve tried the below code:

a = csr_matrix((dat, (row, col)), shape=(9479         


        
2条回答
  •  有刺的猬
    2021-02-05 21:45

    You can use scipy.sparse.save_npz method

    Alternatively consider using Pandas.SparseDataFrame, but be aware that this method is very slow (thanks to @hpaulj for testing and pointing it out)

    Demo:

    generating sparse matrix and SparseDataFrame

    In [55]: import pandas as pd
    
    In [56]: from scipy.sparse import *
    
    In [57]: m = csr_matrix((20, 10), dtype=np.int8)
    
    In [58]: m
    Out[58]:
    <20x10 sparse matrix of type ''
            with 0 stored elements in Compressed Sparse Row format>
    
    In [59]: sdf = pd.SparseDataFrame([pd.SparseSeries(m[i].toarray().ravel(), fill_value=0)
        ...:                           for i in np.arange(m.shape[0])])
        ...:
    
    In [61]: type(sdf)
    Out[61]: pandas.sparse.frame.SparseDataFrame
    
    In [62]: sdf.info()
    
    RangeIndex: 20 entries, 0 to 19
    Data columns (total 10 columns):
    0    20 non-null int8
    1    20 non-null int8
    2    20 non-null int8
    3    20 non-null int8
    4    20 non-null int8
    5    20 non-null int8
    6    20 non-null int8
    7    20 non-null int8
    8    20 non-null int8
    9    20 non-null int8
    dtypes: int8(10)
    memory usage: 280.0 bytes
    

    saving SparseDataFrame to HDF file

    In [64]: sdf.to_hdf('d:/temp/sparse_df.h5', 'sparse_df')
    

    reading from HDF file

    In [65]: store = pd.HDFStore('d:/temp/sparse_df.h5')
    
    In [66]: store
    Out[66]:
    
    File path: d:/temp/sparse_df.h5
    /sparse_df            sparse_frame
    
    In [67]: x = store['sparse_df']
    
    In [68]: type(x)
    Out[68]: pandas.sparse.frame.SparseDataFrame
    
    In [69]: x.info()
    
    Int64Index: 20 entries, 0 to 19
    Data columns (total 10 columns):
    0    20 non-null int8
    1    20 non-null int8
    2    20 non-null int8
    3    20 non-null int8
    4    20 non-null int8
    5    20 non-null int8
    6    20 non-null int8
    7    20 non-null int8
    8    20 non-null int8
    9    20 non-null int8
    dtypes: int8(10)
    memory usage: 360.0 bytes
    

提交回复
热议问题