Is there a good way to avoid memory deep copy or to reduce time spent in multiprocessing?

后端 未结 2 466
时光取名叫无心
时光取名叫无心 2021-02-06 10:47

I am making a memory-based real-time calculation module of \"Big data\" using Pandas module of the Python environment.

So response time is the quality of this module and

2条回答
  •  孤街浪徒
    2021-02-06 11:25

    You will get better performance if you keep interprocess communication to a minimum. Therefore, instead of passing sub-DataFrames as arguments, just pass index values. The subprocess can slice the common DataFrame itself.

    When a subprocess is spawned, it gets a copy of all the globals defined in the calling module of the parent process. Thus, if the large DataFrame, df, is defined in the globals before you spawn a multiprocessing pool, then each spawned subprocess will have access to df.

    On Windows, where there is no fork(), a new python process is started and the calling module is imported. Thus, on Windows, the spawned subprocess has to regenerate df from scratch, which could take time and much additional memory.

    On Linux, however, you have copy-on-write. This means that the spawned subprocess accesses the original globals (of the calling module) without copying them. Only when the subprocess tries to modify the global does Linux then make a separate copy before the value is modified.

    So you can enjoy a performance gain if you avoid modifying globals in your subprocesses. I suggest using the subprocess only for computation. Return the value of the computation, and let the main process collate the results to modify the original DataFrame.

    import pandas as pd
    import numpy as np
    import multiprocessing as mp
    import time
    
    def compute(start, end):
        sub = df.iloc[start:end]
        return start, end, np.abs(sub['column_01']+sub['column_01']) / 2
    
    def collate(retval):
        start, end, arr = retval
        df.ix[start:end, 'new_column'] = arr
    
    def window(seq, n=2):
        """
        Returns a sliding window (of width n) over data from the sequence
        s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...
        """
        for i in range(len(seq)-n+1):
            yield tuple(seq[i:i+n])
    
    if __name__ == "__main__":
        result = []
        # the record count of the real data is over 1 billion with about 10 columns.
        N = 10**3
        df = pd.DataFrame(np.random.randn(N, 4),
                          columns=['column_01', 'column_02', 'column_03', 'column_04'])
    
        pool = mp.Pool()    
        df['new_column'] = np.empty(N, dtype='float')
    
        start_time = time.time()
        idx = np.linspace(0, N, 5+1).astype('int')
        for start, end in window(idx, 2):
            # print(start, end)
            pool.apply_async(compute, args=[start, end], callback=collate)
    
        pool.close()
        pool.join()
        print 'elapsed time  : ', np.round(time.time() - start_time,3)
        print(df.head())
    

提交回复
热议问题