Paralle apply function on df in python

问题

I have a function that go over two lists: items and dates. The function return an updated list of items. For now it runs with apply which is not that efficent on million of rows. I want to make it more efficient by parallelizing it.

Items in item list are on chronological order, as well as the corresponding date list (item_list and date_list are the same size).

This is the df:

Date        item_list            date_list

12/05/20    [I1,I3,I4]    [10/05/20, 11/05/20, 12/05/20 ]
11/05/20    [I1,I3]       [11/05/20 , 14/05/20]

This is the that I want df:

Date        item_list     date_list             items_list_per_date  

12/05/20    [I1,I3,I4]    [10/05/20, 11/05/20, 12/05/20]   [I1,I3]
11/05/20    [I1,I3]       [11/05/20 , 14/05/20]               nan

This is my code:

def get_item_list_per_date(date, items_list, date_list):

    if str(items_list)=="nan" or str(date_list)=="nan":
        return np.nan

    new_date_list = []
    for d in list(date_list):
        new_date_list.append(pd.to_datetime(d))

    if (date in new_date_list) and (len(new_date_list)>1):
        loc = new_date_list.index(date)
    else:
        return np.nan

    updated_items_list = items_list[:loc]

    if len(updated_items_list )==0:
        return np.nan

    return updated_items_list 

df['items_list_per_date'] = df.progress_apply(lambda x: get_item_list_per_date(date=x['date'], items_list=x['items_list'], date_list=x['date_list']),axis=1)

I would love to parallelized it of possible, can you help?

回答1:

Use:

import multiprocessing as mp

def fx(df):
    def __fx(s):
        date = s['Date']
        date_list = s['date_list']
        if date in date_list:
            loc = date_list.index(date)
            return s['item_list'][:loc]
        else:
            return np.nan

    return df.apply(__fx, axis=1)

def parallel_apply(df):
    dfs = filter(lambda d: not d.empty, np.array_split(df, mp.cpu_count()))
    pool = mp.Pool()
    per_date = pd.concat(pool.map(fx, dfs))
    pool.close()
    pool.join()
    return per_date

df['items_list_per_date'] = parallel_apply(df)

Result:

#print(df)

Date        item_list     date_list             items_list_per_date  

12/05/20    [I1,I3,I4]    [10/05/20, 11/05/20, 12/05/20]   [I1,I3]
11/05/20    [I1,I3]       [11/05/20 , 14/05/20]               nan

来源：https://stackoverflow.com/questions/62129477/paralle-apply-function-on-df-in-python

标签

python

parallel-processing

processing-efficiency