Pandas and Multiprocessing Memory Management: Splitting a DataFrame into Multiple Chunks

后端 未结 2 1859
难免孤独
难免孤独 2021-02-06 06:24

I have to process a huge pandas.DataFrame (several tens of GB) on a row by row bases, where each row operation is quite lengthy (a couple of tens of milliseconds).

2条回答
  •  死守一世寂寞
    2021-02-06 06:54

    Ok, so I figured it out after the hint by Sebastian Opałczyński in the comments.

    The problem is that the child processes are forked from the parent, so all of them contain a reference to the original DataFrame. However, the frame is manipulated in the original process, so the copy-on-write behavior kills the whole thing slowly and eventually when the limit of the physical memory is reached.

    There is a simple solution: Instead of pool = mp.Pool(n_jobs), I use the new context feature of multiprocessing:

    ctx = mp.get_context('spawn')
    pool = ctx.Pool(n_jobs)
    

    This guarantees that the Pool processes are just spawned and not forked from the parent process. Accordingly, none of them has access to the original DataFrame and all of them only need a tiny fraction of the parent's memory.

    Note that the mp.get_context('spawn') is only available in Python 3.4 and newer.

提交回复
热议问题