问题
I've got the following function that allows me to do some comparison between the rows of two dataframes (data
and ref
)and return the index of both rows if there's a match.
def get_gene(row):
m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)
return ref.index[m] if m.any() else None
Being a process that takes time (25min for 1.6M rows in data
versus 20K rows in ref
), I tried to speed things up by parallelizing the computation. As pandas doesn't support multiprocessing natively, I used this piece of code that I found on SO and it worked ok with my function get_gene
.
def _apply_df(args):
df, func, kwargs = args
return df.apply(func, **kwargs)
def apply_by_multiprocessing(df, func, **kwargs):
workers = kwargs.pop('workers')
pool = multiprocessing.Pool(processes=workers)
result = pool.map(_apply_df, [(d, func, kwargs) for d in np.array_split(df, workers)])
pool.close()
df = pd.concat(list(result))
return df
It allowed me to go down to 9min of computation. But, if I understood correctly, this code just breaks down my dataframe data
in 4 pieces and send each one to each core of the CPU. Hence, each core ends up doing a comparisons between 400K rows (from data
split in 4) versus 20K rows (ref
).
What I would actually want to do is to split both dataframes based on a value in one of their column so that I only compute comparisons between dataframes of the same 'group':
data.get_group(['a'])
versusref.get_group(['a'])
data.get_group(['b'])
versusref.get_group(['b'])
data.get_group(['c'])
versusref.get_group(['c'])
etc...
which would reduce the amount of computation to do. Each row in data
would only be able to be matched against ~3K rows in ref
, instead of all 20K rows.
Therefore, I tried to modify the code above but I couldn't manage to make it work.
def apply_get_gene(df, func, **kwargs):
reference = pd.read_csv('genomic_positions.csv', index_col=0)
reference = reference.groupby(['Chr'])
df = df.groupby(['Chr'])
chromosome = df.groups.keys()
workers = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=workers)
args_list = [(df.get_group(chrom), func, kwargs, reference.get_group(chrom)) for chrom in chromosome]
results = pool.map(_apply_df, args_list)
pool.close()
pool.join()
return pd.concat(results)
def _apply_df(args):
df, func, kwarg1, kwarg2 = args
return df.apply(func, **kwargs)
def get_gene(row, ref):
m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)
return ref.index[m] if m.any() else None
I'm pretty sure it has to do with the way of how *args
and **kwargs
are passed trough the different functions (because in this case I have to take into account that I want to pass my splitted ref
dataframe with the splitted data
dataframe..).
I think the problem lies within the function _apply_df
. I thought I understood what it really does but the line df, func, kwargs = args
is still bugging me and I think I failed to modify it correctly..
All advices are appreciated !
回答1:
Take a look at starmap():
starmap(func, iterable[, chunksize]) Like map() except that the elements of the iterable are expected to be iterables that are unpacked as arguments.
Hence an iterable of [(1,2), (3, 4)] results in [func(1,2), func(3,4)].
Which seems to be exactly what you need.
回答2:
I post the answer I came up with for readers who might stumble upon this post:
As noted by @Michele Tonutti, I just had to use starmap()
and do a bit of tweaking here and there. The tradeoff is that it applies only my custom function get_gene
with the setting axis=1
but there's probably a way to make it more flexible if needed.
def Detect_gene(data):
reference = pd.read_csv('genomic_positions.csv', index_col=0)
ref = reference.groupby(['Chr'])
df = data.groupby(['Chr'])
chromosome = df.groups.keys()
workers = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=workers)
args = [(df.get_group(chrom), ref.get_group(chrom))
for chrom in chromosome]
results = pool.starmap(apply_get_gene, args)
pool.close()
pool.join()
return pd.concat(results)
def apply_get_gene(df, a):
return df.apply(get_gene, axis=1, ref=a)
def get_gene(row, ref):
m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)
return ref.index[m] if m.any() else None
It now takes ~5min instead of ~9min with the former version of the code and ~25min without multiprocessing.
来源:https://stackoverflow.com/questions/51948034/parallelizing-comparisons-between-two-dataframes-with-multiprocessing