Is there a good way to avoid memory deep copy or to reduce time spent in multiprocessing?

后端未结

关注

 2  466

时光取名叫无心 2021-02-06 10:47

I am making a memory-based real-time calculation module of \"Big data\" using Pandas module of the Python environment.

So response time is the quality of this module and

2条回答

孤街浪徒 (楼主)

2021-02-06 11:25
You will get better performance if you keep interprocess communication to a minimum. Therefore, instead of passing sub-DataFrames as arguments, just pass index values. The subprocess can slice the common DataFrame itself.

When a subprocess is spawned, it gets a copy of all the globals defined in the calling module of the parent process. Thus, if the large DataFrame, df, is defined in the globals before you spawn a multiprocessing pool, then each spawned subprocess will have access to df.

On Windows, where there is no fork(), a new python process is started and the calling module is imported. Thus, on Windows, the spawned subprocess has to regenerate df from scratch, which could take time and much additional memory.

On Linux, however, you have copy-on-write. This means that the spawned subprocess accesses the original globals (of the calling module) without copying them. Only when the subprocess tries to modify the global does Linux then make a separate copy before the value is modified.

So you can enjoy a performance gain if you avoid modifying globals in your subprocesses. I suggest using the subprocess only for computation. Return the value of the computation, and let the main process collate the results to modify the original DataFrame.
```
import pandas as pd
import numpy as np
import multiprocessing as mp
import time

def compute(start, end):
    sub = df.iloc[start:end]
    return start, end, np.abs(sub['column_01']+sub['column_01']) / 2

def collate(retval):
    start, end, arr = retval
    df.ix[start:end, 'new_column'] = arr

def window(seq, n=2):
    """
    Returns a sliding window (of width n) over data from the sequence
    s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...
    """
    for i in range(len(seq)-n+1):
        yield tuple(seq[i:i+n])

if __name__ == "__main__":
    result = []
    # the record count of the real data is over 1 billion with about 10 columns.
    N = 10**3
    df = pd.DataFrame(np.random.randn(N, 4),
                      columns=['column_01', 'column_02', 'column_03', 'column_04'])

    pool = mp.Pool()    
    df['new_column'] = np.empty(N, dtype='float')

    start_time = time.time()
    idx = np.linspace(0, N, 5+1).astype('int')
    for start, end in window(idx, 2):
        # print(start, end)
        pool.apply_async(compute, args=[start, end], callback=collate)

    pool.close()
    pool.join()
    print 'elapsed time  : ', np.round(time.time() - start_time,3)
    print(df.head())
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...