I have written the program (below) to:
pandas dataframe
groupby
using a specific column value
I had the same issue. I needed to process a huge text corpus while keeping a knowledge base of few DataFrames of millions of rows loaded in memory. I think this issue is common so I will keep my answer oriented for general purposes.
A combination of settings solved the problem for me (1 & 3 & 5 only might do it for you):
Use Pool.imap
(or imap_unordered
) instead of Pool.map
. This will iterate over data lazily than loading all of it in memory before starting processing.
Set a value to chunksize
parameter. This will make imap
faster too.
Set a value to maxtasksperchild
parameter.
Append output to disk than in memory. Instantly or every while when it reaches a certain size.
Run the code in different batches. You can use itertools.islice if you have an iterator. The idea is to split your list(gen_matrix_df_list.values())
to three or more lists, then you pass the first third only to map
or imap
, then the second third in another run, etc. Since you have a list you can simply slice it in the same line of code.