Make Pandas DataFrame apply() use all cores?

前端 未结 6 1533
陌清茗
陌清茗 2020-11-27 09:52

As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its com

相关标签:
6条回答
  • 2020-11-27 10:25

    Here is an example of sklearn base transformer, in which pandas apply is parallelized

    import multiprocessing as mp
    from sklearn.base import TransformerMixin, BaseEstimator
    
    class ParllelTransformer(BaseEstimator, TransformerMixin):
        def __init__(self,
                     n_jobs=1):
            """
            n_jobs - parallel jobs to run
            """
            self.variety = variety
            self.user_abbrevs = user_abbrevs
            self.n_jobs = n_jobs
        def fit(self, X, y=None):
            return self
        def transform(self, X, *_):
            X_copy = X.copy()
            cores = mp.cpu_count()
            partitions = 1
    
            if self.n_jobs <= -1:
                partitions = cores
            elif self.n_jobs <= 0:
                partitions = 1
            else:
                partitions = min(self.n_jobs, cores)
    
            if partitions == 1:
                # transform sequentially
                return X_copy.apply(self._transform_one)
    
            # splitting data into batches
            data_split = np.array_split(X_copy, partitions)
    
            pool = mp.Pool(cores)
    
            # Here reduce function - concationation of transformed batches
            data = pd.concat(
                pool.map(self._preprocess_part, data_split)
            )
    
            pool.close()
            pool.join()
            return data
        def _transform_part(self, df_part):
            return df_part.apply(self._transform_one)
        def _transform_one(self, line):
            # some kind of transformations here
            return line
    

    for more info see https://towardsdatascience.com/4-easy-steps-to-improve-your-machine-learning-code-performance-88a0b0eeffa8

    0 讨论(0)
  • 2020-11-27 10:26

    you can try pandarallel instead: A simple and efficient tool to parallelize your pandas operations on all your CPUs (On Linux & macOS)

    • Parallelization has a cost (instanciating new processes, sending data via shared memory, etc ...), so parallelization is efficiant only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallezation not always worth it.
    • Functions applied should NOT be lambda functions.
    from pandarallel import pandarallel
    from math import sin
    
    pandarallel.initialize()
    
    # FORBIDDEN
    df.parallel_apply(lambda x: sin(x**2), axis=1)
    
    # ALLOWED
    def func(x):
        return sin(x**2)
    
    df.parallel_apply(func, axis=1)
    

    see https://github.com/nalepae/pandarallel

    0 讨论(0)
  • 2020-11-27 10:28

    The simplest way is to use Dask's map_partitions. You need these imports (you will need to pip install dask):

    import pandas as pd
    import dask.dataframe as dd
    from dask.multiprocessing import get
    

    and the syntax is

    data = <your_pandas_dataframe>
    ddata = dd.from_pandas(data, npartitions=30)
    
    def myfunc(x,y,z, ...): return <whatever>
    
    res = ddata.map_partitions(lambda df: df.apply((lambda row: myfunc(*row)), axis=1)).compute(get=get)  
    

    (I believe that 30 is a suitable number of partitions if you have 16 cores). Just for completeness, I timed the difference on my machine (16 cores):

    data = pd.DataFrame()
    data['col1'] = np.random.normal(size = 1500000)
    data['col2'] = np.random.normal(size = 1500000)
    
    ddata = dd.from_pandas(data, npartitions=30)
    def myfunc(x,y): return y*(x**2+1)
    def apply_myfunc_to_DF(df): return df.apply((lambda row: myfunc(*row)), axis=1)
    def pandas_apply(): return apply_myfunc_to_DF(data)
    def dask_apply(): return ddata.map_partitions(apply_myfunc_to_DF).compute(get=get)  
    def vectorized(): return myfunc(data['col1'], data['col2']  )
    
    t_pds = timeit.Timer(lambda: pandas_apply())
    print(t_pds.timeit(number=1))
    

    28.16970546543598

    t_dsk = timeit.Timer(lambda: dask_apply())
    print(t_dsk.timeit(number=1))
    

    2.708152851089835

    t_vec = timeit.Timer(lambda: vectorized())
    print(t_vec.timeit(number=1))
    

    0.010668013244867325

    Giving a factor of 10 speedup going from pandas apply to dask apply on partitions. Of course, if you have a function you can vectorize, you should - in this case the function (y*(x**2+1)) is trivially vectorized, but there are plenty of things that are impossible to vectorize.

    0 讨论(0)
  • 2020-11-27 10:33

    To use all (physical or logical) cores, you could try mapply as an alternative to swifter and pandarallel.

    You can set the amount of cores (and the chunking behaviour) upon init:

    import pandas as pd
    import mapply
    
    mapply.init(n_workers=-1)
    
    ...
    
    df.mapply(myfunc, axis=1)
    

    By default (n_workers=-1), the package uses all physical CPUs available on the system. If your system uses hyper-threading (usually twice the amount of physical CPUs would show up), mapply will spawn one extra worker to prioritise the multiprocessing pool over other processes on the system.

    Depending on your definition of all your cores, you could also use all logical cores instead (beware that like this the CPU-bound processes will be fighting for physical CPUs, which might slow down your operation):

    import multiprocessing
    n_workers = multiprocessing.cpu_count()
    
    # or more explicit
    import psutil
    n_workers = psutil.cpu_count(logical=True)
    
    0 讨论(0)
  • 2020-11-27 10:36

    You may use the swifter package:

    pip install swifter
    

    It works as a plugin for pandas, allowing you to reuse the apply function:

    import swifter
    
    def some_function(data):
        return data * 10
    
    data['out'] = data['in'].swifter.apply(some_function)
    

    It will automatically figure out the most efficient way to parallelize the function, no matter if it's vectorized (as in the above example) or not.

    More examples and a performance comparison are available on GitHub. Note that the package is under active development, so the API may change.

    Also note that this will not work automatically for string columns. When using strings, Swifter will fallback to a “simple” Pandas apply, which will not be parallel. In this case, even forcing it to use dask will not create performance improvements, and you would be better off just splitting your dataset manually and parallelizing using multiprocessing.

    0 讨论(0)
  • 2020-11-27 10:40

    If you want to stay in native python:

    import multiprocessing as mp
    
    with mp.Pool(mp.cpu_count()) as pool:
        df['newcol'] = pool.map(f, df['col'])
    

    will apply function f in a parallel fashion to column col of dataframe df

    0 讨论(0)
提交回复
热议问题