Can dask parralelize reading fom a csv file?

前端 未结 2 840
暗喜
暗喜 2021-01-30 23:20

I\'m converting a large textfile to a hdf storage in hopes of a faster data access. The conversion works allright, however reading from the csv file is not done in parallel. It

2条回答
  •  攒了一身酷
    2021-01-31 00:10

    Piggybacking off of @MRocklin's answer, in newer versions of dask, you can use df.compute(scheduler='processes') or df.compute(scheduler='threads') to convert to pandas using multiprocessing or multithreading:

    from dask import dataframe as ddf
    df = ddf.read_csv("data/Measurements*.csv",
                 sep=';', 
                 parse_dates=["DATETIME"], 
                 blocksize=1000000,
                 )
    
    df = df.compute(scheduler='processes')     # convert to pandas
    
    df['Type'] = df['Type'].astype('category')
    df['Condition'] = df['Condition'].astype('category')
    
    df.to_hdf('data/data.hdf', 'Measurements', format='table', mode='w')
    

提交回复
热议问题