Parallelizing comparisons between two dataframes with multiprocessing

问题

I've got the following function that allows me to do some comparison between the rows of two dataframes (data and ref)and return the index of both rows if there's a match.

def get_gene(row):

    m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)

    return ref.index[m] if m.any() else None

Being a process that takes time (25min for 1.6M rows in data versus 20K rows in ref), I tried to speed things up by parallelizing the computation. As pandas doesn't support multiprocessing natively, I used this piece of code that I found on SO and it worked ok with my function get_gene.

def _apply_df(args):
    df, func, kwargs = args
    return df.apply(func, **kwargs)


def apply_by_multiprocessing(df, func, **kwargs):

    workers = kwargs.pop('workers')
    pool = multiprocessing.Pool(processes=workers)

    result = pool.map(_apply_df, [(d, func, kwargs) for d in np.array_split(df, workers)])

    pool.close()

    df = pd.concat(list(result))

    return df

It allowed me to go down to 9min of computation. But, if I understood correctly, this code just breaks down my dataframe data in 4 pieces and send each one to each core of the CPU. Hence, each core ends up doing a comparisons between 400K rows (from data split in 4) versus 20K rows (ref).

What I would actually want to do is to split both dataframes based on a value in one of their column so that I only compute comparisons between dataframes of the same 'group':

data.get_group(['a']) versus ref.get_group(['a'])
data.get_group(['b']) versus ref.get_group(['b'])
data.get_group(['c']) versus ref.get_group(['c'])
etc...

which would reduce the amount of computation to do. Each row in data would only be able to be matched against ~3K rows in ref, instead of all 20K rows.

Therefore, I tried to modify the code above but I couldn't manage to make it work.

def apply_get_gene(df, func, **kwargs):

    reference = pd.read_csv('genomic_positions.csv', index_col=0)
    reference = reference.groupby(['Chr'])

    df = df.groupby(['Chr'])
    chromosome = df.groups.keys()



    workers = multiprocessing.cpu_count()
    pool = multiprocessing.Pool(processes=workers)


    args_list = [(df.get_group(chrom), func, kwargs, reference.get_group(chrom)) for chrom in chromosome]

    results = pool.map(_apply_df, args_list)

    pool.close()                                                          
    pool.join()                                                           

    return pd.concat(results)


def _apply_df(args):

    df, func, kwarg1, kwarg2 = args

    return df.apply(func, **kwargs)


def get_gene(row, ref):

    m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)

    return ref.index[m] if m.any() else None

I'm pretty sure it has to do with the way of how *args and **kwargs are passed trough the different functions (because in this case I have to take into account that I want to pass my splitted ref dataframe with the splitted data dataframe..). I think the problem lies within the function _apply_df. I thought I understood what it really does but the line df, func, kwargs = args is still bugging me and I think I failed to modify it correctly..

All advices are appreciated !

回答1:

Take a look at starmap():

starmap(func, iterable[, chunksize]) Like map() except that the elements of the iterable are expected to be iterables that are unpacked as arguments.

Hence an iterable of [(1,2), (3, 4)] results in [func(1,2), func(3,4)].

Which seems to be exactly what you need.

回答2:

I post the answer I came up with for readers who might stumble upon this post:

As noted by @Michele Tonutti, I just had to use starmap() and do a bit of tweaking here and there. The tradeoff is that it applies only my custom function get_gene with the setting axis=1 but there's probably a way to make it more flexible if needed.

def Detect_gene(data):

    reference = pd.read_csv('genomic_positions.csv', index_col=0)
    ref = reference.groupby(['Chr'])

    df = data.groupby(['Chr'])
    chromosome = df.groups.keys()

    workers = multiprocessing.cpu_count()
    pool = multiprocessing.Pool(processes=workers)


    args = [(df.get_group(chrom), ref.get_group(chrom)) 
            for chrom in chromosome]

    results = pool.starmap(apply_get_gene, args)

    pool.close()                                                          
    pool.join()                                                           

    return pd.concat(results)


def apply_get_gene(df, a):

    return df.apply(get_gene, axis=1, ref=a)


def get_gene(row, ref):

    m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)

    return ref.index[m] if m.any() else None

It now takes ~5min instead of ~9min with the former version of the code and ~25min without multiprocessing.

来源：https://stackoverflow.com/questions/51948034/parallelizing-comparisons-between-two-dataframes-with-multiprocessing

标签

python

pandas

dataframe

apply

python-multiprocessing