I\'m converting a large textfile to a hdf storage in hopes of a faster data access. The conversion works allright, however reading from the csv file is not done in parallel. It
Piggybacking off of @MRocklin's answer, in newer versions of dask, you can use df.compute(scheduler='processes')
or df.compute(scheduler='threads')
to convert to pandas using multiprocessing or multithreading:
from dask import dataframe as ddf
df = ddf.read_csv("data/Measurements*.csv",
sep=';',
parse_dates=["DATETIME"],
blocksize=1000000,
)
df = df.compute(scheduler='processes') # convert to pandas
df['Type'] = df['Type'].astype('category')
df['Condition'] = df['Condition'].astype('category')
df.to_hdf('data/data.hdf', 'Measurements', format='table', mode='w')