Why apply sometimes isn't faster than for-loop in pandas dataframe?

前端 未结 1 1512
死守一世寂寞
死守一世寂寞 2020-11-27 19:46

It seems apply could accelerate the operation process on dataframe in most cases. But when I use apply I doesn\'t find the speedup. Here comes my e

相关标签:
1条回答
  • 2020-11-27 20:21

    It is my understanding that .apply is not generally faster than iteration over the axis. I believe underneath the hood it is merely a loop over the axis, except you are incurring the overhead of a function call each time in this case.

    If we look at the source code, we can see that essentially we are iterating over the indicated axis and applying the function, building the individual results as series into a dictionary, and the finally calling the dataframe constructor on the dictionary returning a new DataFrame:

        if axis == 0:
            series_gen = (self._ixs(i, axis=1)
                          for i in range(len(self.columns)))
            res_index = self.columns
            res_columns = self.index
        elif axis == 1:
            res_index = self.index
            res_columns = self.columns
            values = self.values
            series_gen = (Series.from_array(arr, index=res_columns, name=name,
                                            dtype=dtype)
                          for i, (arr, name) in enumerate(zip(values,
                                                              res_index)))
        else:  # pragma : no cover
            raise AssertionError('Axis must be 0 or 1, got %s' % str(axis))
    
        i = None
        keys = []
        results = {}
        if ignore_failures:
            successes = []
            for i, v in enumerate(series_gen):
                try:
                    results[i] = func(v)
                    keys.append(v.name)
                    successes.append(i)
                except Exception:
                    pass
            # so will work with MultiIndex
            if len(successes) < len(res_index):
                res_index = res_index.take(successes)
        else:
            try:
                for i, v in enumerate(series_gen):
                    results[i] = func(v)
                    keys.append(v.name)
            except Exception as e:
                if hasattr(e, 'args'):
                    # make sure i is defined
                    if i is not None:
                        k = res_index[i]
                        e.args = e.args + ('occurred at index %s' %
                                           pprint_thing(k), )
                raise
    
        if len(results) > 0 and is_sequence(results[0]):
            if not isinstance(results[0], Series):
                index = res_columns
            else:
                index = None
    
            result = self._constructor(data=results, index=index)
            result.columns = res_index
    
            if axis == 1:
                result = result.T
            result = result._convert(datetime=True, timedelta=True, copy=False)
    
        else:
    
            result = Series(results)
            result.index = res_index
    
        return result
    

    Specifically:

    for i, v in enumerate(series_gen):
                    results[i] = func(v)
                    keys.append(v.name)
    

    Where series_gen was constructed based on the requested axis.

    To get more performance out of a function, you can follow the advice given here.

    Essentially, your options are:

    1. Write a C extension
    2. Use numba (a JIT compiler)
    3. Use pandas.eval to squeeze performance out of large Dataframes
    0 讨论(0)
提交回复
热议问题