pandas multiprocessing apply

后端未结

关注

 8  1370

I\'m trying to use multiprocessing with pandas dataframe, that is split the dataframe to 8 parts. apply some function to each part using apply (with each part processed in d

相关标签:

8条回答

忘了有多久

2020-11-28 06:13
I also run into the same problem when I use multiprocessing.map() to apply function to different chunk of a large dataframe.

I just want to add several points just in case other people run into the same problem as I do.
1. remember to add if __name__ == '__main__':
2. execute the file in a .py file, if you use ipython/jupyter notebook, then you can not run multiprocessing (this is true for my case, though I have no clue)
0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2020-11-28 06:14

I ended up using concurrent.futures.ProcessPoolExecutor.map in place of multiprocessing.Pool.map which took 316 microseconds for some code that took 12 seconds in serial.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2020-11-28 06:15
Since I don't have much of your data script, this is a guess, but I'd suggest using p.map instead of apply_async with the callback.
```
p = mp.Pool(8)
pool_results = p.map(process, np.array_split(big_df,8))
p.close()
p.join()
results = []
for result in pool_results:
    results.extend(result)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

北荒

2020-11-28 06:25

A more generic version based on the author solution, that allows to run it on every function and dataframe:

from multiprocessing import  Pool
from functools import partial
import numpy as np

def parallelize(data, func, num_of_processes=8):
    data_split = np.array_split(data, num_of_processes)
    pool = Pool(num_of_processes)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

def run_on_subset(func, data_subset):
    return data_subset.apply(func, axis=1)

def parallelize_on_rows(data, func, num_of_processes=8):
    return parallelize(data, partial(run_on_subset, func), num_of_processes)

So the following line:

df.apply(some_func, axis=1)

Will become:

parallelize_on_rows(df, some_func)

0 讨论(0)

囚心锁ツ

2020-11-28 06:31
You can use https://github.com/nalepae/pandarallel, as in the following example:
```
from pandarallel import pandarallel
from math import sin

pandarallel.initialize()

def func(x):
    return sin(x**2)

df.parallel_apply(func, axis=1)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2020-11-28 06:35
To use all (physical or logical) cores, you could try mapply as an alternative to swifter and pandarallel.

You can set the amount of cores (and the chunking behaviour) upon init:
```
import pandas as pd
import mapply

mapply.init(n_workers=-1)

def process_apply(x):
    # do some stuff to data here

def process(df):
    # spawns a pathos.multiprocessing.ProcessPool if sensible
    res = df.mapply(process_apply, axis=1)
    return res
```
By default (n_workers=-1), the package uses all physical CPUs available on the system. If your system uses hyper-threading (usually twice the amount of physical CPUs would show up), mapply will spawn one extra worker to prioritise the multiprocessing pool over other processes on the system.

You could also use all logical cores instead (beware that like this the CPU-bound processes will be fighting for physical CPUs, which might slow down your operation):
```
import multiprocessing
n_workers = multiprocessing.cpu_count()

# or more explicit
import psutil
n_workers = psutil.cpu_count(logical=True)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页