Easiest way to read csv files with multiprocessing in Pandas

前端未结

关注

 4  1725

Here is my question.
With bunch of .csv files(or other files). Pandas is an easy way to read them and save into Dataframe format. But when the amount of f

相关标签:

4条回答

旧时难觅i

2020-12-03 06:12

Using Pool:

import os
import pandas as pd 
from multiprocessing import Pool

# wrap your csv importer in a function that can be mapped
def read_csv(filename):
    'converts a filename to a pandas dataframe'
    return pd.read_csv(filename)


def main():

    # get a list of file names
    files = os.listdir('.')
    file_list = [filename for filename in files if filename.split('.')[1]=='csv']

    # set up your pool
    with Pool(processes=8) as pool: # or whatever your hardware can support

        # have your pool map the file names to dataframes
        df_list = pool.map(read_csv, file_list)

        # reduce the list of dataframes to a single dataframe
        combined_df = pd.concat(df_list, ignore_index=True)

if __name__ == '__main__':
    main()

0 讨论(0)

不知归路

2020-12-03 06:23

I am not getting map/map_async to work, but managed to work with apply_async.

Two possible ways (I have no idea which one is better):

A) Concat at the end
B) Concat during

I find glob easy to list and fitler files from a directory

from glob import glob
import pandas as pd
from multiprocessing import Pool

folder = "./task_1/" # note the "/" at the end
file_list = glob(folder+'*.xlsx')

def my_read(filename):
    f = pd.read_csv(filename)
    return (f.VALUE.as_matrix()).reshape(75,90)

#DF_LIST = [] # A) end
DF = pd.DataFrame() # B) during

def DF_LIST_append(result):
    #DF_LIST.append(result) # A) end
    global DF # B) during
    DF = pd.concat([DF,result], ignore_index=True) # B) during

pool = Pool(processes=8)

for file in file_list:
    pool.apply_async(my_read, args = (file,), callback = DF_LIST_append)

pool.close()
pool.join()

#DF = pd.concat(DF_LIST, ignore_index=True) # A) end

print(DF.shape)

0 讨论(0)

日久生厌

2020-12-03 06:32

If you aren't against using another library, you could use Graphlab's sframe. This creates an object similar to data frames which is very fast to read data if performance is a big issue.

0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-12-03 06:36

dask library is designed to address not only but certainly your issue.

0 讨论(0)
发布评论:

提交评论
- 加载中...