Pandas and Multiprocessing Memory Management: Splitting a DataFrame into Multiple Chunks

后端 未结 2 1866
难免孤独
难免孤独 2021-02-06 06:24

I have to process a huge pandas.DataFrame (several tens of GB) on a row by row bases, where each row operation is quite lengthy (a couple of tens of milliseconds).

2条回答
  •  梦毁少年i
    2021-02-06 06:55

    A better implementation is just to use the pandas implementation of chunked dataframe as a generator and feed it into the "pool.imap" function pd.read_csv('.csv', chucksize=) https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

    Benefit: It doesn't read in the whole df in your main process (save memory). Each child process will be pointed the chunk it needs only. --> solve the child memory issue.

    Overhead: It requires you to save your df as csv first and read it in again using pd.read_csv --> I/O time.

    Note: chunksize is not available to pd.read_pickle or other loading methods that are compressed on storage.

    def main():
        # Job parameters
        n_jobs = 4  # Poolsize
        size = (10000, 1000)  # Size of DataFrame
        chunksize = 100  # Maximum size of Frame Chunk
    
        # Preparation
        df = pd.DataFrame(np.random.rand(*size))
        pool = mp.Pool(n_jobs)
    
        print('Starting MP')
    
        # Execute the wait and print function in parallel
    
        df_chunked = pd.read_csv('.csv',chunksize = chunksize) # modified
        pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, df_chunked) # modified
    
        pool.close()
        pool.join()
    
        print('DONE')
    

提交回复
热议问题