问题
I have two large Pandas Dataframes (1GB+) with data that needs to be processed by multiple workers. I'm able to perform the operations without issues in a toy example with much smaller Dataframes (DFs).
Below is my reproducible example.
I've tried several routes:
- I am unable to take advantage of
chunk
. The DFs need to be sliced into specific pieces on an index before each piece is fed to the workers. Andchunk
can only slice them to an arbitrary length. - Using
starmap
: This is what you see in the code below. Pre-slicing the DFs on the indexes and storing the pieces in an iterable. The pieces can be passed as small frames (or dicts) to the worker processes. This solution is not feasible at the sizes of my real DFs- the iterable never finishes being created. I've tried and failed using a generator/yield
forstarmap
. I would appreciate an example of a workaround if this is an option. - Using
imap
: The entire input DFs end up going to each of the worker processes. I was able to use generators/yield
through an intermediate function that would slice the DFs and make them available for each worker without having a huge iterable. But the process was taking longer than if I'd use afor
loop. The overhead of passing the data to the workers was the bottleneck.
I am ready to conclude that multiprocessing cannot be applied to a large data table that needs to be sliced before being sent to workers.
import random
import numpy as np
import pandas as pd
import multiprocessing
def df_slicer(df1,df2):
while len(df1)>0:
value_to_slice_df_by=df1.iloc[0]['Number'] #Will use first found number as the index for this process.
a= df1[df1['Number']==value_to_slice_df_by].copy()
b= df2[df2['Number']==value_to_slice_df_by].copy()
print('len(df1): {}, len(df2): {}'.format(len(a),len(b)))
return[a,b]
if __name__ == '__main__':
#These are large and will be pulled in from Pandas pickles.
df1=pd.DataFrame(np.random.randint(3000, size=(100000, 2)), columns=['Number', 'Info_df1'])
df2=pd.DataFrame(np.random.randint(5000, size=(100000, 2)), columns=['Number', 'Info_df2'])
iterable= [[df1.loc[df1['Number']==i], df2.loc[df2['Number']==i]] for i in list(np.unique(df1['Number']))]
pool = multiprocessing.Pool( processes=multiprocessing.cpu_count() - 1)
for res in pool.starmap(df_slicer, [[i[0],i[1]] for i in iterable]):
result=res
pass
print('Done')
pool.close(); pool.join()
来源:https://stackoverflow.com/questions/62545562/multiprocessing-with-large-iterable