Multiprocessing with large iterable

问题

I have two large Pandas Dataframes (1GB+) with data that needs to be processed by multiple workers. I'm able to perform the operations without issues in a toy example with much smaller Dataframes (DFs).

Below is my reproducible example.

I've tried several routes:

I am unable to take advantage of chunk. The DFs need to be sliced into specific pieces on an index before each piece is fed to the workers. And chunk can only slice them to an arbitrary length.
Using starmap: This is what you see in the code below. Pre-slicing the DFs on the indexes and storing the pieces in an iterable. The pieces can be passed as small frames (or dicts) to the worker processes. This solution is not feasible at the sizes of my real DFs- the iterable never finishes being created. I've tried and failed using a generator/ yield for starmap. I would appreciate an example of a workaround if this is an option.
Using imap: The entire input DFs end up going to each of the worker processes. I was able to use generators/yield through an intermediate function that would slice the DFs and make them available for each worker without having a huge iterable. But the process was taking longer than if I'd use a for loop. The overhead of passing the data to the workers was the bottleneck.

I am ready to conclude that multiprocessing cannot be applied to a large data table that needs to be sliced before being sent to workers.

import random
import numpy as np
import pandas as pd
import multiprocessing

def df_slicer(df1,df2):
    while len(df1)>0:
        
        value_to_slice_df_by=df1.iloc[0]['Number']   #Will use first found number as the index for this process. 

        a= df1[df1['Number']==value_to_slice_df_by].copy()
        b= df2[df2['Number']==value_to_slice_df_by].copy()

        print('len(df1): {}, len(df2): {}'.format(len(a),len(b)))
        
        return[a,b]


if __name__ == '__main__':
    
    #These are large and will be pulled in from Pandas pickles. 
    df1=pd.DataFrame(np.random.randint(3000, size=(100000, 2)), columns=['Number', 'Info_df1'])
    df2=pd.DataFrame(np.random.randint(5000, size=(100000, 2)), columns=['Number', 'Info_df2'])

    iterable= [[df1.loc[df1['Number']==i], df2.loc[df2['Number']==i]] for i in list(np.unique(df1['Number']))]

    pool = multiprocessing.Pool( processes=multiprocessing.cpu_count() - 1)

    for res in pool.starmap(df_slicer, [[i[0],i[1]] for i in iterable]):
        result=res
        pass

    print('Done')
    pool.close(); pool.join()

来源：https://stackoverflow.com/questions/62545562/multiprocessing-with-large-iterable

标签

python

multiprocessing