I am making a memory-based real-time calculation module of \"Big data\" using Pandas module of the Python environment.
So response time is the quality of this module and
You will get better performance if you keep interprocess communication to a minimum. Therefore, instead of passing sub-DataFrames as arguments, just pass index values. The subprocess can slice the common DataFrame itself.
When a subprocess is spawned, it gets a copy of all the globals defined in the
calling module of the parent process. Thus, if the large DataFrame, df
, is
defined in the globals before you spawn a multiprocessing pool, then each
spawned subprocess will have access to df
.
On Windows, where there is no fork()
, a new python process is started and the
calling module is imported. Thus, on Windows, the spawned subprocess has to
regenerate df
from scratch, which could take time and much additional memory.
On Linux, however, you have copy-on-write. This means that the spawned subprocess accesses the original globals (of the calling module) without copying them. Only when the subprocess tries to modify the global does Linux then make a separate copy before the value is modified.
So you can enjoy a performance gain if you avoid modifying globals in your subprocesses. I suggest using the subprocess only for computation. Return the value of the computation, and let the main process collate the results to modify the original DataFrame.
import pandas as pd
import numpy as np
import multiprocessing as mp
import time
def compute(start, end):
sub = df.iloc[start:end]
return start, end, np.abs(sub['column_01']+sub['column_01']) / 2
def collate(retval):
start, end, arr = retval
df.ix[start:end, 'new_column'] = arr
def window(seq, n=2):
"""
Returns a sliding window (of width n) over data from the sequence
s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...
"""
for i in range(len(seq)-n+1):
yield tuple(seq[i:i+n])
if __name__ == "__main__":
result = []
# the record count of the real data is over 1 billion with about 10 columns.
N = 10**3
df = pd.DataFrame(np.random.randn(N, 4),
columns=['column_01', 'column_02', 'column_03', 'column_04'])
pool = mp.Pool()
df['new_column'] = np.empty(N, dtype='float')
start_time = time.time()
idx = np.linspace(0, N, 5+1).astype('int')
for start, end in window(idx, 2):
# print(start, end)
pool.apply_async(compute, args=[start, end], callback=collate)
pool.close()
pool.join()
print 'elapsed time : ', np.round(time.time() - start_time,3)
print(df.head())