Sharing large pandas DataFrame with multiprocessing for loop in Python

问题

Using Python 2.7 on a Windows machine, I have a large pandas DataFrame (about 7 million rows and 20+ columns) from a SQL query that I'd like to filter by looping through IDs then run calculations on the resulting filtered data. I'd also like to do this in parallel.

I know that if I try to do this with standard methods from the multiprocessing package in Windows, each process will generate a new instance of that large DataFrame for its own use and my memory will be eaten up. So I'm trying to use information I've read on remote managers to make my DataFrame a proxy object and share that across each process but I'm struggling to make it work.

My code is below, and I can get it to work on a single for loop no problem, but again the memory gets eaten up if I make it a parallel process:

import multiprocessing
import pandas
import pyodbc

def download(args):
    """pydobc code to download data from sql database"""

def calc(dataset, index):
    filter_data = dataset[dataset['ID'] == index]
    """run calculations on filtered DataFrame"""
    """append results to local csv"""

if __name__ == '__main__':
    data_1 = download(args_1)
    data_2 = download(args_2)
    all_data = data_1.append(data_2) #Append downloaded DataFrames into one

    unique_id = pandas.unique(all_data['ID'])
    pool = multiprocessing.Pool()
    [pool.apply_async(calc, args=(all_data, x) ) for x in unique_id ]

回答1:

Q : "Sharing large pandas DataFrame with multiprocessing for loop in Python ?"

While there are tools to share some data in the multiprocessing module, the actual use will here actually represent an anti-pattern to the presented will to operate this, for performance reasons, inside a Pool-instance, in a "just"-[CONCURRENT]-fashion.

Why?

You spend immense costs on moving the filtering into a Pool-of-independent ( "just"-[CONCURRENT] ) workers, yet each of them is waiting to get served by, again the central GIL-lock, which turns the Manager's work again into a pure-[SERIAL] and even worse, being RAM I/O-bound, the performance suffocation from having no free access to RAM, goes principally in a wrong direction ).

THE ECONOMY OF ADD-ON COSTS v/s THE TRAP of AMDAHL's LAW :

The speed of burning the money ( add-on costs ), that are not visible from a few SLOC-s can be ( and often is) way higher, than any ( only potential, until well engineered, tuned and validated ) in-vivo performance benefit, from operating several lines of code-execution in a "just"-[CONCURRENT] ( the harder for a True-[PARALLEL] ) fashion.

来源：https://stackoverflow.com/questions/35614952/sharing-large-pandas-dataframe-with-multiprocessing-for-loop-in-python

标签

python

pandas

parallel-processing

multiprocessing