Pandas and Multiprocessing Memory Management: Splitting a DataFrame into Multiple Chunks
问题 I have to process a huge pandas.DataFrame (several tens of GB) on a row by row bases, where each row operation is quite lengthy (a couple of tens of milliseconds). So I had the idea to split up the frame into chunks and process each chunk in parallel using multiprocessing . This does speed-up the task, but the memory consumption is a nightmare. Although each child process should in principle only consume a tiny chunk of the data, it needs (almost) as much memory as the original parent process