How to solve memory issues problems while multiprocessing using Pool.map()?

前端 未结 4 1081
猫巷女王i
猫巷女王i 2020-12-12 15:41

I have written the program (below) to:

  • read a huge text file as pandas dataframe
  • then groupby using a specific column value
4条回答
  •  有刺的猬
    2020-12-12 16:12

    I had the same issue. I needed to process a huge text corpus while keeping a knowledge base of few DataFrames of millions of rows loaded in memory. I think this issue is common so I will keep my answer oriented for general purposes.

    A combination of settings solved the problem for me (1 & 3 & 5 only might do it for you):

    1. Use Pool.imap (or imap_unordered) instead of Pool.map. This will iterate over data lazily than loading all of it in memory before starting processing.

    2. Set a value to chunksize parameter. This will make imap faster too.

    3. Set a value to maxtasksperchild parameter.

    4. Append output to disk than in memory. Instantly or every while when it reaches a certain size.

    5. Run the code in different batches. You can use itertools.islice if you have an iterator. The idea is to split your list(gen_matrix_df_list.values()) to three or more lists, then you pass the first third only to map or imap, then the second third in another run, etc. Since you have a list you can simply slice it in the same line of code.

提交回复
热议问题