Read a file as SciPy sparse matrix directly

跟風遠走 提交于 2020-08-11 11:06:18

问题


Is it possible to read a space separated file, with each line containing float numbers directly as SciPy sparse matrix?


回答1:


Given: A space separated file containing ~56 million rows and 25 space separated floating point numbers in each row with a lot of sparsity.

Output: Convert the file into SciPy CSR sparse matrix as fast as possible

May be there are better solutions out there, but this solution worked for me after a lot of suggestions from @CJR (some of which I couldn't take into account).

Also, may be there is a better solution using hdf5, but, this is the solution using Pandas dataframe and finishes up in 6.7 minutes and takes around 50 GB of RAM on a 32 core machine for 56,651,070 rows and 25 space separated floating point numbers in each row with a lot of sparsity.

import numpy as np
import scipy.sparse as sps
import pandas as pd
import time
import swifter

start_time = time.time()
input_file_name = "df"
sep = " "
df = pd.read_csv(input_file_name)
df['array_column'] = df['array_column'].swifter.allow_dask_on_strings().apply(lambda x: np.fromstring(x, sep = sep), axis =1)
df_np_sp_matrix = sps.csr_matrix(np.stack(df['array_column'].to_numpy()))
print("--- %s seconds ---" % (time.time() - start_time))

Output:

--- 406.22810888290405 seconds ---

Matrix Size.

df_np_sp_matrix

Output:

<56651070x25 sparse matrix of type '<class 'numpy.float64'>'
with 508880850 stored elements in Compressed Sparse Row format>


来源:https://stackoverflow.com/questions/63292290/read-a-file-as-scipy-sparse-matrix-directly

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!