Converting a series of ints to strings - Why is apply much faster than astype?

前端 未结 2 456
借酒劲吻你
借酒劲吻你 2020-12-05 00:31

I have a pandas.Series containing integers, but I need to convert these to strings for some downstream tools. So suppose I had a Series object:

相关标签:
2条回答
  • 2020-12-05 01:02

    Let's begin with a bit of general advise: If you're interested in finding the bottlenecks of Python code you can use a profiler to find the functions/parts that eat up most of the time. In this case I use a line-profiler because you can actually see the implementation and the time spent on each line.

    However, these tools don't work with C or Cython by default. Given that CPython (that's the Python interpreter I'm using), NumPy and pandas make heavy use of C and Cython there will be a limit how far I'll get with profiling.

    Actually: one probably could extend profiling to the Cython code and probably also the C code by recompiling it with debug symbols and tracing, however it's not an easy task to compile these libraries so I won't do that (but if someone likes to do that the Cython documentation includes a page about profiling Cython code).

    But let's see how far I can get:

    Line-Profiling Python code

    I'm going to use line-profiler and a Jupyter Notebook here:

    %load_ext line_profiler
    
    import numpy as np
    import pandas as pd
    
    x = pd.Series(np.random.randint(0, 100, 100000))
    

    Profiling x.astype

    %lprun -f x.astype x.astype(str)
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
        87                                                   @wraps(func)
        88                                                   def wrapper(*args, **kwargs):
        89         1           12     12.0      0.0              old_arg_value = kwargs.pop(old_arg_name, None)
        90         1            5      5.0      0.0              if old_arg_value is not None:
        91                                                           if mapping is not None:
       ...
       118         1       663354 663354.0    100.0              return func(*args, **kwargs)
    

    So that's simply a decorator and 100% of the time is spent in the decorated function. So let's profile the decorated function:

    %lprun -f x.astype.__wrapped__ x.astype(str)
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      3896                                               @deprecate_kwarg(old_arg_name='raise_on_error', new_arg_name='errors',
      3897                                                                mapping={True: 'raise', False: 'ignore'})
      3898                                               def astype(self, dtype, copy=True, errors='raise', **kwargs):
      3899                                                   """
      ...
      3975                                                   """
      3976         1           28     28.0      0.0          if is_dict_like(dtype):
      3977                                                       if self.ndim == 1:  # i.e. Series
      ...
      4001                                           
      4002                                                   # else, only a single dtype is given
      4003         1           14     14.0      0.0          new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
      4004         1       685863 685863.0     99.9                                       **kwargs)
      4005         1          340    340.0      0.0          return self._constructor(new_data).__finalize__(self)
    

    Source

    Again one line is the bottleneck so let's check the _data.astype method:

    %lprun -f x._data.astype x.astype(str)
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      3461                                               def astype(self, dtype, **kwargs):
      3462         1       695866 695866.0    100.0          return self.apply('astype', dtype=dtype, **kwargs)
    

    Okay, another delegate, let's see what _data.apply does:

    %lprun -f x._data.apply x.astype(str)
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      3251                                               def apply(self, f, axes=None, filter=None, do_integrity_check=False,
      3252                                                         consolidate=True, **kwargs):
      3253                                                   """
      ...
      3271                                                   """
      3272                                           
      3273         1           12     12.0      0.0          result_blocks = []
      ...
      3309                                           
      3310         1           10     10.0      0.0          aligned_args = dict((k, kwargs[k])
      3311         1           29     29.0      0.0                              for k in align_keys
      3312                                                                       if hasattr(kwargs[k], 'reindex_axis'))
      3313                                           
      3314         2           28     14.0      0.0          for b in self.blocks:
      ...
      3329         1       674974 674974.0    100.0              applied = getattr(b, f)(**kwargs)
      3330         1           30     30.0      0.0              result_blocks = _extend_blocks(applied, result_blocks)
      3331                                           
      3332         1           10     10.0      0.0          if len(result_blocks) == 0:
      3333                                                       return self.make_empty(axes or self.axes)
      3334         1           10     10.0      0.0          bm = self.__class__(result_blocks, axes or self.axes,
      3335         1           76     76.0      0.0                              do_integrity_check=do_integrity_check)
      3336         1           13     13.0      0.0          bm._consolidate_inplace()
      3337         1            7      7.0      0.0          return bm
    

    Source

    And again ... one function call is taking all the time, this time it's x._data.blocks[0].astype:

    %lprun -f x._data.blocks[0].astype x.astype(str)
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
       542                                               def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
       543         1           18     18.0      0.0          return self._astype(dtype, copy=copy, errors=errors, values=values,
       544         1       671092 671092.0    100.0                              **kwargs)
    

    .. which is another delegate...

    %lprun -f x._data.blocks[0]._astype x.astype(str)
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
       546                                               def _astype(self, dtype, copy=False, errors='raise', values=None,
       547                                                           klass=None, mgr=None, **kwargs):
       548                                                   """
       ...
       557                                                   """
       558         1           11     11.0      0.0          errors_legal_values = ('raise', 'ignore')
       559                                           
       560         1            8      8.0      0.0          if errors not in errors_legal_values:
       561                                                       invalid_arg = ("Expected value of kwarg 'errors' to be one of {}. "
       562                                                                      "Supplied value is '{}'".format(
       563                                                                          list(errors_legal_values), errors))
       564                                                       raise ValueError(invalid_arg)
       565                                           
       566         1           23     23.0      0.0          if inspect.isclass(dtype) and issubclass(dtype, ExtensionDtype):
       567                                                       msg = ("Expected an instance of {}, but got the class instead. "
       568                                                              "Try instantiating 'dtype'.".format(dtype.__name__))
       569                                                       raise TypeError(msg)
       570                                           
       571                                                   # may need to convert to categorical
       572                                                   # this is only called for non-categoricals
       573         1           72     72.0      0.0          if self.is_categorical_astype(dtype):
       ...
       595                                           
       596                                                   # astype processing
       597         1           16     16.0      0.0          dtype = np.dtype(dtype)
       598         1           19     19.0      0.0          if self.dtype == dtype:
       ...
       603         1            8      8.0      0.0          if klass is None:
       604         1           13     13.0      0.0              if dtype == np.object_:
       605                                                           klass = ObjectBlock
       606         1            6      6.0      0.0          try:
       607                                                       # force the copy here
       608         1            7      7.0      0.0              if values is None:
       609                                           
       610         1            8      8.0      0.0                  if issubclass(dtype.type,
       611         1           14     14.0      0.0                                (compat.text_type, compat.string_types)):
       612                                           
       613                                                               # use native type formatting for datetime/tz/timedelta
       614         1           15     15.0      0.0                      if self.is_datelike:
       615                                                                   values = self.to_native_types()
       616                                           
       617                                                               # astype formatting
       618                                                               else:
       619         1            8      8.0      0.0                          values = self.values
       620                                           
       621                                                           else:
       622                                                               values = self.get_values(dtype=dtype)
       623                                           
       624                                                           # _astype_nansafe works fine with 1-d only
       625         1       665777 665777.0     99.9                  values = astype_nansafe(values.ravel(), dtype, copy=True)
       626         1           32     32.0      0.0                  values = values.reshape(self.shape)
       627                                           
       628         1           17     17.0      0.0              newb = make_block(values, placement=self.mgr_locs, dtype=dtype,
       629         1          269    269.0      0.0                                klass=klass)
       630                                                   except:
       631                                                       if errors == 'raise':
       632                                                           raise
       633                                                       newb = self.copy() if copy else self
       634                                           
       635         1            8      8.0      0.0          if newb.is_numeric and self.is_numeric:
       ...
       642         1            6      6.0      0.0          return newb
    

    Source

    ... okay, still not there. Let's check out astype_nansafe:

    %lprun -f pd.core.internals.astype_nansafe x.astype(str)
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
       640                                           def astype_nansafe(arr, dtype, copy=True):
       641                                               """ return a view if copy is False, but
       642                                                   need to be very careful as the result shape could change! """
       643         1           13     13.0      0.0      if not isinstance(dtype, np.dtype):
       644                                                   dtype = pandas_dtype(dtype)
       645                                           
       646         1            8      8.0      0.0      if issubclass(dtype.type, text_type):
       647                                                   # in Py3 that's str, in Py2 that's unicode
       648         1       663317 663317.0    100.0          return lib.astype_unicode(arr.ravel()).reshape(arr.shape)
       ...
    

    Source

    Again one it's one line that takes 100%, so I'll go one function further:

    %lprun -f pd.core.dtypes.cast.lib.astype_unicode x.astype(str)
    
    UserWarning: Could not extract a code object for the object <built-in function astype_unicode>
    

    Okay, we found a built-in function, that means it's a C function. In this case it's a Cython function. But it means we cannot dig deeper with line-profiler. So I'll stop here for now.

    Profiling x.apply

    %lprun -f x.apply x.apply(str)
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      2426                                               def apply(self, func, convert_dtype=True, args=(), **kwds):
      2427                                                   """
      ...
      2523                                                   """
      2524         1           84     84.0      0.0          if len(self) == 0:
      2525                                                       return self._constructor(dtype=self.dtype,
      2526                                                                                index=self.index).__finalize__(self)
      2527                                           
      2528                                                   # dispatch to agg
      2529         1           11     11.0      0.0          if isinstance(func, (list, dict)):
      2530                                                       return self.aggregate(func, *args, **kwds)
      2531                                           
      2532                                                   # if we are a string, try to dispatch
      2533         1           12     12.0      0.0          if isinstance(func, compat.string_types):
      2534                                                       return self._try_aggregate_string_function(func, *args, **kwds)
      2535                                           
      2536                                                   # handle ufuncs and lambdas
      2537         1            7      7.0      0.0          if kwds or args and not isinstance(func, np.ufunc):
      2538                                                       f = lambda x: func(x, *args, **kwds)
      2539                                                   else:
      2540         1            6      6.0      0.0              f = func
      2541                                           
      2542         1          154    154.0      0.1          with np.errstate(all='ignore'):
      2543         1           11     11.0      0.0              if isinstance(f, np.ufunc):
      2544                                                           return f(self)
      2545                                           
      2546                                                       # row-wise access
      2547         1          188    188.0      0.1              if is_extension_type(self.dtype):
      2548                                                           mapped = self._values.map(f)
      2549                                                       else:
      2550         1         6238   6238.0      3.3                  values = self.asobject
      2551         1       181910 181910.0     95.5                  mapped = lib.map_infer(values, f, convert=convert_dtype)
      2552                                           
      2553         1           28     28.0      0.0          if len(mapped) and isinstance(mapped[0], Series):
      2554                                                       from pandas.core.frame import DataFrame
      2555                                                       return DataFrame(mapped.tolist(), index=self.index)
      2556                                                   else:
      2557         1           19     19.0      0.0              return self._constructor(mapped,
      2558         1         1870   1870.0      1.0                                       index=self.index).__finalize__(self)
    

    Source

    Again it's one function that takes most of the time: lib.map_infer ...

    %lprun -f pd.core.series.lib.map_infer x.apply(str)
    
    Could not extract a code object for the object <built-in function map_infer>
    

    Okay, that's another Cython function.

    This time there's another (although less significant) contributor with ~3%: values = self.asobject. But I'll ignore this for now, because we're interested in the major contributors.

    Going into C/Cython

    The functions called by astype

    This is the astype_unicode function:

    cpdef ndarray[object] astype_unicode(ndarray arr):
        cdef:
            Py_ssize_t i, n = arr.size
            ndarray[object] result = np.empty(n, dtype=object)
    
        for i in range(n):
            # we can use the unsafe version because we know `result` is mutable
            # since it was created from `np.empty`
            util.set_value_at_unsafe(result, i, unicode(arr[i]))
    
        return result
    

    Source

    This function uses this helper:

    cdef inline set_value_at_unsafe(ndarray arr, object loc, object value):
        cdef:
            Py_ssize_t i, sz
        if is_float_object(loc):
            casted = int(loc)
            if casted == loc:
                loc = casted
        i = <Py_ssize_t> loc
        sz = cnp.PyArray_SIZE(arr)
    
        if i < 0:
            i += sz
        elif i >= sz:
            raise IndexError('index out of bounds')
    
        assign_value_1d(arr, i, value)
    

    Source

    Which itself uses this C function:

    PANDAS_INLINE int assign_value_1d(PyArrayObject* ap, Py_ssize_t _i,
                                      PyObject* v) {
        npy_intp i = (npy_intp)_i;
        char* item = (char*)PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);
        return PyArray_DESCR(ap)->f->setitem(v, item, ap);
    }
    

    Source

    Functions called by apply

    This is the implementation of the map_infer function:

    def map_infer(ndarray arr, object f, bint convert=1):
        cdef:
            Py_ssize_t i, n
            ndarray[object] result
            object val
    
        n = len(arr)
        result = np.empty(n, dtype=object)
        for i in range(n):
            val = f(util.get_value_at(arr, i))
    
            # unbox 0-dim arrays, GH #690
            if is_array(val) and PyArray_NDIM(val) == 0:
                # is there a faster way to unbox?
                val = val.item()
    
            result[i] = val
    
        if convert:
            return maybe_convert_objects(result,
                                         try_float=0,
                                         convert_datetime=0,
                                         convert_timedelta=0)
    
        return result
    

    Source

    With this helper:

    cdef inline object get_value_at(ndarray arr, object loc):
        cdef:
            Py_ssize_t i, sz
            int casted
    
        if is_float_object(loc):
            casted = int(loc)
            if casted == loc:
                loc = casted
        i = <Py_ssize_t> loc
        sz = cnp.PyArray_SIZE(arr)
    
        if i < 0 and sz > 0:
            i += sz
        elif i >= sz or sz == 0:
            raise IndexError('index out of bounds')
    
        return get_value_1d(arr, i)
    

    Source

    Which uses this C function:

    PANDAS_INLINE PyObject* get_value_1d(PyArrayObject* ap, Py_ssize_t i) {
        char* item = (char*)PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);
        return PyArray_Scalar(item, PyArray_DESCR(ap), (PyObject*)ap);
    }
    

    Source

    Some thoughts on the Cython code

    There are some differences between the Cython codes that are called eventually.

    The one taken by astype uses unicode while the apply path uses the function passed in. Let's see if that makes a difference (again IPython/Jupyter makes it very easy to compile Cython code yourself):

    %load_ext cython
    
    %%cython
    
    import numpy as np
    cimport numpy as np
    
    cpdef object func_called_by_astype(np.ndarray arr):
        cdef np.ndarray[object] ret = np.empty(arr.size, dtype=object)
        for i in range(arr.size):
            ret[i] = unicode(arr[i])
        return ret
    
    cpdef object func_called_by_apply(np.ndarray arr, object f):
        cdef np.ndarray[object] ret = np.empty(arr.size, dtype=object)
        for i in range(arr.size):
            ret[i] = f(arr[i])
        return ret
    

    Timing:

    import numpy as np
    
    arr = np.random.randint(0, 10000, 1000000)
    %timeit func_called_by_astype(arr)
    514 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit func_called_by_apply(arr, str)
    632 ms ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Okay, there is a difference but it's wrong, it would actually indicate that apply would be slightly slower.

    But remember the asobject call that I mentioned earlier in the apply function? Could that be the reason? Let's see:

    import numpy as np
    
    arr = np.random.randint(0, 10000, 1000000)
    %timeit func_called_by_astype(arr)
    557 ms ± 33.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit func_called_by_apply(arr.astype(object), str)
    317 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Now it looks better. The conversion to an object array made the function called by apply much faster. There is a simple reason for this: str is a Python function and these are generally much faster if you already have Python objects and NumPy (or Pandas) don't need to create a Python wrapper for the value stored in the array (which is generally not a Python object, except when the array is of dtype object).

    However that doesn't explain the huge difference that you've seen. My suspicion is that there is actually an additional difference in the ways the arrays are iterated over and the elements are set in the result. Very likely the:

    val = f(util.get_value_at(arr, i))
    if is_array(val) and PyArray_NDIM(val) == 0:
        val = val.item()
    result[i] = val
    

    part of the map_infer function is faster than:

    for i in range(n):
        # we can use the unsafe version because we know `result` is mutable
        # since it was created from `np.empty`
        util.set_value_at_unsafe(result, i, unicode(arr[i]))
    

    which is called by the astype(str) path. The comments of the first function seem to indicate that the writer of map_infer actually tried to make the code as fast as possible (see the comment about "is there a faster way to unbox?" while the other one maybe was written without special care about performance. But that's just a guess.

    Also on my computer I'm actually quite close to the performance of the x.astype(str) and x.apply(str) already:

    import numpy as np
    
    arr = np.random.randint(0, 100, 1000000)
    s = pd.Series(arr)
    %timeit s.astype(str)
    535 ms ± 23.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit func_called_by_astype(arr)
    547 ms ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    
    %timeit s.apply(str)
    216 ms ± 8.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit func_called_by_apply(arr.astype(object), str)
    272 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Note that I also checked some other variants that return a different result:

    %timeit s.values.astype(str)  # array of strings
    407 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit list(map(str, s.values.tolist()))  # list of strings
    184 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    Interestingly the Python loop with list and map seems to be the fastest on my computer.

    I actually made a small benchmark including plot:

    import pandas as pd
    import simple_benchmark
    
    def Series_astype(series):
        return series.astype(str)
    
    def Series_apply(series):
        return series.apply(str)
    
    def Series_tolist_map(series):
        return list(map(str, series.values.tolist()))
    
    def Series_values_astype(series):
        return series.values.astype(str)
    
    
    arguments = {2**i: pd.Series(np.random.randint(0, 100, 2**i)) for i in range(2, 20)}
    b = simple_benchmark.benchmark(
        [Series_astype, Series_apply, Series_tolist_map, Series_values_astype],
        arguments,
        argument_name='Series size'
    )
    
    %matplotlib notebook
    b.plot()
    

    Note that it's a log-log plot because of the huge range of sizes I covered in the benchmark. However lower means faster here.

    The results may be different for different versions of Python/NumPy/Pandas. So if you want to compare it, these are my versions:

    Versions
    --------
    Python 3.6.5
    NumPy 1.14.2
    Pandas 0.22.0
    
    0 讨论(0)
  • 2020-12-05 01:08

    Performance

    It's worth looking at actual performance before beginning any investigation since, contrary to popular opinion, list(map(str, x)) appears to be slower than x.apply(str).

    import pandas as pd, numpy as np
    
    ### Versions: Pandas 0.20.3, Numpy 1.13.1, Python 3.6.2 ###
    
    x = pd.Series(np.random.randint(0, 100, 100000))
    
    %timeit x.apply(str)          # 42ms   (1)
    %timeit x.map(str)            # 42ms   (2)
    %timeit x.astype(str)         # 559ms  (3)
    %timeit [str(i) for i in x]   # 566ms  (4)
    %timeit list(map(str, x))     # 536ms  (5)
    %timeit x.values.astype(str)  # 25ms   (6)
    

    Points worth noting:

    1. (5) is marginally quicker than (3) / (4), which we expect as more work is moved into C [assuming no lambda function is used].
    2. (6) is by far the fastest.
    3. (1) / (2) are similar.
    4. (3) / (4) are similar.

    Why is x.map / x.apply fast?

    This appears to be because it uses fast compiled Cython code:

    cpdef ndarray[object] astype_str(ndarray arr):
        cdef:
            Py_ssize_t i, n = arr.size
            ndarray[object] result = np.empty(n, dtype=object)
    
        for i in range(n):
            # we can use the unsafe version because we know `result` is mutable
            # since it was created from `np.empty`
            util.set_value_at_unsafe(result, i, str(arr[i]))
    
        return result
    

    Why is x.astype(str) slow?

    Pandas applies str to each item in the series, not using the above Cython.

    Hence performance is comparable to [str(i) for i in x] / list(map(str, x)).

    Why is x.values.astype(str) so fast?

    Numpy does not apply a function on each element of the array. One description of this I found:

    If you did s.values.astype(str) what you get back is an object holding int. This is numpy doing the conversion, whereas pandas iterates over each item and calls str(item) on it. So if you do s.astype(str) you have an object holding str.

    There is a technical reason why the numpy version hasn't been implemented in the case of no-nulls.

    0 讨论(0)
提交回复
热议问题