发表新帖

发表新帖

How to solve memory issues problems while multiprocessing using Pool.map()?

前端未结

关注

 4  1081

猫巷女王i 2020-12-12 15:41

I have written the program (below) to:

read a huge text file as pandas dataframe
then groupby using a specific column value

4条回答

有刺的猬 (楼主)

2020-12-12 16:12
I had the same issue. I needed to process a huge text corpus while keeping a knowledge base of few DataFrames of millions of rows loaded in memory. I think this issue is common so I will keep my answer oriented for general purposes.

A combination of settings solved the problem for me (1 & 3 & 5 only might do it for you):
1. Use Pool.imap (or imap_unordered) instead of Pool.map. This will iterate over data lazily than loading all of it in memory before starting processing.
2. Set a value to chunksize parameter. This will make imap faster too.
3. Set a value to maxtasksperchild parameter.
4. Append output to disk than in memory. Instantly or every while when it reaches a certain size.
5. Run the code in different batches. You can use itertools.islice if you have an iterator. The idea is to split your list(gen_matrix_df_list.values()) to three or more lists, then you pass the first third only to map or imap, then the second third in another run, etc. Since you have a list you can simply slice it in the same line of code.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

热议问题