How to input large data into python pandas using looping or parallel computing?

后端 未结 5 637
谎友^
谎友^ 2021-02-13 21:46

I have a csv file of 8gb and I am not able to run the code as it shows memory error.

file = \"./data.csv\"
df = pd.read_csv(file, sep=\"/\", header=0, dtype=str         


        
相关标签:
5条回答
  • 2021-02-13 22:25

    pandas read_csv has two argument options that you could use to do what you want to do:

    nrows : to specify the number of rows you want to read
    skiprows : to specify the first row you want to read
    

    Refer to documentation at: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

    0 讨论(0)
  • 2021-02-13 22:28
    import numpy as np
    from multiprocessing import Pool
    
    def processor(df):
    
        # Some work
    
        df.sort_values('id', inplace=True)
        return df
    
    size = 8
    df_split = np.array_split(df, size)
    
    cores = 8
    pool = Pool(cores)
    for n, frame in enumerate(pool.imap(processor, df_split), start=1):
        frame.to_csv('{}'.format(n))
    pool.close()
    pool.join()
    
    0 讨论(0)
  • 2021-02-13 22:33

    If you don't need all columns you may also use usecols parameter:

    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

    usecols : array-like or callable, default None
    
    Return a subset of the columns. [...] 
    Using this parameter results in much faster parsing time and lower memory usage.
    
    0 讨论(0)
  • 2021-02-13 22:47

    You also might want to use the das framework and it's built in dask.dataframe. Essentially, the csv file is transformed into multiple pandas dataframes, each read in when necessary. However, not every pandas command is avaialble within dask.

    0 讨论(0)
  • 2021-02-13 22:49

    Use the chunksize parameter to read one chunk at the time and save the files to disk. This will split the original file in equal parts by 100000 rows each:

    file = "./data.csv"
    chunks = pd.read_csv(file, sep="/", header=0, dtype=str, chunksize = 100000)
    
    for it, chunk in enumerate(chunks):
        chunk.to_csv('chunk_{}.csv'.format(it), sep="/") 
    

    If you know the number of rows of the original file you can calculate the exact chunksize to split the file in 8 equal parts (nrows/8).

    0 讨论(0)
提交回复
热议问题