问题
Is it possible to read a space separated file, with each line containing float numbers directly as SciPy sparse matrix?
回答1:
Given: A space separated file containing ~56 million rows and 25 space separated floating point numbers in each row with a lot of sparsity.
Output: Convert the file into SciPy CSR sparse matrix as fast as possible
May be there are better solutions out there, but this solution worked for me after a lot of suggestions from @CJR (some of which I couldn't take into account).
Also, may be there is a better solution using hdf5, but, this is the solution using Pandas dataframe and finishes up in 6.7 minutes and takes around 50 GB of RAM on a 32 core machine for 56,651,070 rows and 25 space separated floating point numbers in each row with a lot of sparsity.
import numpy as np
import scipy.sparse as sps
import pandas as pd
import time
import swifter
start_time = time.time()
input_file_name = "df"
sep = " "
df = pd.read_csv(input_file_name)
df['array_column'] = df['array_column'].swifter.allow_dask_on_strings().apply(lambda x: np.fromstring(x, sep = sep), axis =1)
df_np_sp_matrix = sps.csr_matrix(np.stack(df['array_column'].to_numpy()))
print("--- %s seconds ---" % (time.time() - start_time))
Output:
--- 406.22810888290405 seconds ---
Matrix Size.
df_np_sp_matrix
Output:
<56651070x25 sparse matrix of type '<class 'numpy.float64'>'
with 508880850 stored elements in Compressed Sparse Row format>
来源:https://stackoverflow.com/questions/63292290/read-a-file-as-scipy-sparse-matrix-directly