I have to process a huge pandas.DataFrame
(several tens of GB) on a row by row bases, where each row operation is quite lengthy (a couple of tens of milliseconds).
A better implementation is just to use the pandas implementation of chunked dataframe as a generator and feed it into the "pool.imap" function
pd.read_csv('
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Benefit: It doesn't read in the whole df in your main process (save memory). Each child process will be pointed the chunk it needs only. --> solve the child memory issue.
Overhead: It requires you to save your df as csv first and read it in again using pd.read_csv
--> I/O time.
Note: chunksize is not available to pd.read_pickle
or other loading methods that are compressed on storage.
def main():
# Job parameters
n_jobs = 4 # Poolsize
size = (10000, 1000) # Size of DataFrame
chunksize = 100 # Maximum size of Frame Chunk
# Preparation
df = pd.DataFrame(np.random.rand(*size))
pool = mp.Pool(n_jobs)
print('Starting MP')
# Execute the wait and print function in parallel
df_chunked = pd.read_csv('.csv',chunksize = chunksize) # modified
pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, df_chunked) # modified
pool.close()
pool.join()
print('DONE')