I want to compress and store a humongous Scipy matrix in HDF5 format. How do I do this? I\'ve tried the below code:
a = csr_matrix((dat, (row, col)), shape=(9479
You can use scipy.sparse.save_npz method
Alternatively consider using Pandas.SparseDataFrame, but be aware that this method is very slow (thanks to @hpaulj for testing and pointing it out)
Demo:
generating sparse matrix and SparseDataFrame
In [55]: import pandas as pd
In [56]: from scipy.sparse import *
In [57]: m = csr_matrix((20, 10), dtype=np.int8)
In [58]: m
Out[58]:
<20x10 sparse matrix of type ''
with 0 stored elements in Compressed Sparse Row format>
In [59]: sdf = pd.SparseDataFrame([pd.SparseSeries(m[i].toarray().ravel(), fill_value=0)
...: for i in np.arange(m.shape[0])])
...:
In [61]: type(sdf)
Out[61]: pandas.sparse.frame.SparseDataFrame
In [62]: sdf.info()
RangeIndex: 20 entries, 0 to 19
Data columns (total 10 columns):
0 20 non-null int8
1 20 non-null int8
2 20 non-null int8
3 20 non-null int8
4 20 non-null int8
5 20 non-null int8
6 20 non-null int8
7 20 non-null int8
8 20 non-null int8
9 20 non-null int8
dtypes: int8(10)
memory usage: 280.0 bytes
saving SparseDataFrame to HDF file
In [64]: sdf.to_hdf('d:/temp/sparse_df.h5', 'sparse_df')
reading from HDF file
In [65]: store = pd.HDFStore('d:/temp/sparse_df.h5')
In [66]: store
Out[66]:
File path: d:/temp/sparse_df.h5
/sparse_df sparse_frame
In [67]: x = store['sparse_df']
In [68]: type(x)
Out[68]: pandas.sparse.frame.SparseDataFrame
In [69]: x.info()
Int64Index: 20 entries, 0 to 19
Data columns (total 10 columns):
0 20 non-null int8
1 20 non-null int8
2 20 non-null int8
3 20 non-null int8
4 20 non-null int8
5 20 non-null int8
6 20 non-null int8
7 20 non-null int8
8 20 non-null int8
9 20 non-null int8
dtypes: int8(10)
memory usage: 360.0 bytes