I have a pandas.Series
containing integers, but I need to convert these to strings for some downstream tools. So suppose I had a Series
object:
Let's begin with a bit of general advise: If you're interested in finding the bottlenecks of Python code you can use a profiler to find the functions/parts that eat up most of the time. In this case I use a line-profiler because you can actually see the implementation and the time spent on each line.
However, these tools don't work with C or Cython by default. Given that CPython (that's the Python interpreter I'm using), NumPy and pandas make heavy use of C and Cython there will be a limit how far I'll get with profiling.
Actually: one probably could extend profiling to the Cython code and probably also the C code by recompiling it with debug symbols and tracing, however it's not an easy task to compile these libraries so I won't do that (but if someone likes to do that the Cython documentation includes a page about profiling Cython code).
But let's see how far I can get:
I'm going to use line-profiler and a Jupyter Notebook here:
%load_ext line_profiler
import numpy as np
import pandas as pd
x = pd.Series(np.random.randint(0, 100, 100000))
x.astype
%lprun -f x.astype x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
87 @wraps(func)
88 def wrapper(*args, **kwargs):
89 1 12 12.0 0.0 old_arg_value = kwargs.pop(old_arg_name, None)
90 1 5 5.0 0.0 if old_arg_value is not None:
91 if mapping is not None:
...
118 1 663354 663354.0 100.0 return func(*args, **kwargs)
So that's simply a decorator and 100% of the time is spent in the decorated function. So let's profile the decorated function:
%lprun -f x.astype.__wrapped__ x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3896 @deprecate_kwarg(old_arg_name='raise_on_error', new_arg_name='errors',
3897 mapping={True: 'raise', False: 'ignore'})
3898 def astype(self, dtype, copy=True, errors='raise', **kwargs):
3899 """
...
3975 """
3976 1 28 28.0 0.0 if is_dict_like(dtype):
3977 if self.ndim == 1: # i.e. Series
...
4001
4002 # else, only a single dtype is given
4003 1 14 14.0 0.0 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
4004 1 685863 685863.0 99.9 **kwargs)
4005 1 340 340.0 0.0 return self._constructor(new_data).__finalize__(self)
Source
Again one line is the bottleneck so let's check the _data.astype
method:
%lprun -f x._data.astype x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3461 def astype(self, dtype, **kwargs):
3462 1 695866 695866.0 100.0 return self.apply('astype', dtype=dtype, **kwargs)
Okay, another delegate, let's see what _data.apply
does:
%lprun -f x._data.apply x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3251 def apply(self, f, axes=None, filter=None, do_integrity_check=False,
3252 consolidate=True, **kwargs):
3253 """
...
3271 """
3272
3273 1 12 12.0 0.0 result_blocks = []
...
3309
3310 1 10 10.0 0.0 aligned_args = dict((k, kwargs[k])
3311 1 29 29.0 0.0 for k in align_keys
3312 if hasattr(kwargs[k], 'reindex_axis'))
3313
3314 2 28 14.0 0.0 for b in self.blocks:
...
3329 1 674974 674974.0 100.0 applied = getattr(b, f)(**kwargs)
3330 1 30 30.0 0.0 result_blocks = _extend_blocks(applied, result_blocks)
3331
3332 1 10 10.0 0.0 if len(result_blocks) == 0:
3333 return self.make_empty(axes or self.axes)
3334 1 10 10.0 0.0 bm = self.__class__(result_blocks, axes or self.axes,
3335 1 76 76.0 0.0 do_integrity_check=do_integrity_check)
3336 1 13 13.0 0.0 bm._consolidate_inplace()
3337 1 7 7.0 0.0 return bm
Source
And again ... one function call is taking all the time, this time it's x._data.blocks[0].astype
:
%lprun -f x._data.blocks[0].astype x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
542 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
543 1 18 18.0 0.0 return self._astype(dtype, copy=copy, errors=errors, values=values,
544 1 671092 671092.0 100.0 **kwargs)
.. which is another delegate...
%lprun -f x._data.blocks[0]._astype x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
546 def _astype(self, dtype, copy=False, errors='raise', values=None,
547 klass=None, mgr=None, **kwargs):
548 """
...
557 """
558 1 11 11.0 0.0 errors_legal_values = ('raise', 'ignore')
559
560 1 8 8.0 0.0 if errors not in errors_legal_values:
561 invalid_arg = ("Expected value of kwarg 'errors' to be one of {}. "
562 "Supplied value is '{}'".format(
563 list(errors_legal_values), errors))
564 raise ValueError(invalid_arg)
565
566 1 23 23.0 0.0 if inspect.isclass(dtype) and issubclass(dtype, ExtensionDtype):
567 msg = ("Expected an instance of {}, but got the class instead. "
568 "Try instantiating 'dtype'.".format(dtype.__name__))
569 raise TypeError(msg)
570
571 # may need to convert to categorical
572 # this is only called for non-categoricals
573 1 72 72.0 0.0 if self.is_categorical_astype(dtype):
...
595
596 # astype processing
597 1 16 16.0 0.0 dtype = np.dtype(dtype)
598 1 19 19.0 0.0 if self.dtype == dtype:
...
603 1 8 8.0 0.0 if klass is None:
604 1 13 13.0 0.0 if dtype == np.object_:
605 klass = ObjectBlock
606 1 6 6.0 0.0 try:
607 # force the copy here
608 1 7 7.0 0.0 if values is None:
609
610 1 8 8.0 0.0 if issubclass(dtype.type,
611 1 14 14.0 0.0 (compat.text_type, compat.string_types)):
612
613 # use native type formatting for datetime/tz/timedelta
614 1 15 15.0 0.0 if self.is_datelike:
615 values = self.to_native_types()
616
617 # astype formatting
618 else:
619 1 8 8.0 0.0 values = self.values
620
621 else:
622 values = self.get_values(dtype=dtype)
623
624 # _astype_nansafe works fine with 1-d only
625 1 665777 665777.0 99.9 values = astype_nansafe(values.ravel(), dtype, copy=True)
626 1 32 32.0 0.0 values = values.reshape(self.shape)
627
628 1 17 17.0 0.0 newb = make_block(values, placement=self.mgr_locs, dtype=dtype,
629 1 269 269.0 0.0 klass=klass)
630 except:
631 if errors == 'raise':
632 raise
633 newb = self.copy() if copy else self
634
635 1 8 8.0 0.0 if newb.is_numeric and self.is_numeric:
...
642 1 6 6.0 0.0 return newb
Source
... okay, still not there. Let's check out astype_nansafe
:
%lprun -f pd.core.internals.astype_nansafe x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
640 def astype_nansafe(arr, dtype, copy=True):
641 """ return a view if copy is False, but
642 need to be very careful as the result shape could change! """
643 1 13 13.0 0.0 if not isinstance(dtype, np.dtype):
644 dtype = pandas_dtype(dtype)
645
646 1 8 8.0 0.0 if issubclass(dtype.type, text_type):
647 # in Py3 that's str, in Py2 that's unicode
648 1 663317 663317.0 100.0 return lib.astype_unicode(arr.ravel()).reshape(arr.shape)
...
Source
Again one it's one line that takes 100%, so I'll go one function further:
%lprun -f pd.core.dtypes.cast.lib.astype_unicode x.astype(str)
UserWarning: Could not extract a code object for the object <built-in function astype_unicode>
Okay, we found a built-in function
, that means it's a C function. In this case it's a Cython function. But it means we cannot dig deeper with line-profiler. So I'll stop here for now.
x.apply
%lprun -f x.apply x.apply(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2426 def apply(self, func, convert_dtype=True, args=(), **kwds):
2427 """
...
2523 """
2524 1 84 84.0 0.0 if len(self) == 0:
2525 return self._constructor(dtype=self.dtype,
2526 index=self.index).__finalize__(self)
2527
2528 # dispatch to agg
2529 1 11 11.0 0.0 if isinstance(func, (list, dict)):
2530 return self.aggregate(func, *args, **kwds)
2531
2532 # if we are a string, try to dispatch
2533 1 12 12.0 0.0 if isinstance(func, compat.string_types):
2534 return self._try_aggregate_string_function(func, *args, **kwds)
2535
2536 # handle ufuncs and lambdas
2537 1 7 7.0 0.0 if kwds or args and not isinstance(func, np.ufunc):
2538 f = lambda x: func(x, *args, **kwds)
2539 else:
2540 1 6 6.0 0.0 f = func
2541
2542 1 154 154.0 0.1 with np.errstate(all='ignore'):
2543 1 11 11.0 0.0 if isinstance(f, np.ufunc):
2544 return f(self)
2545
2546 # row-wise access
2547 1 188 188.0 0.1 if is_extension_type(self.dtype):
2548 mapped = self._values.map(f)
2549 else:
2550 1 6238 6238.0 3.3 values = self.asobject
2551 1 181910 181910.0 95.5 mapped = lib.map_infer(values, f, convert=convert_dtype)
2552
2553 1 28 28.0 0.0 if len(mapped) and isinstance(mapped[0], Series):
2554 from pandas.core.frame import DataFrame
2555 return DataFrame(mapped.tolist(), index=self.index)
2556 else:
2557 1 19 19.0 0.0 return self._constructor(mapped,
2558 1 1870 1870.0 1.0 index=self.index).__finalize__(self)
Source
Again it's one function that takes most of the time: lib.map_infer
...
%lprun -f pd.core.series.lib.map_infer x.apply(str)
Could not extract a code object for the object <built-in function map_infer>
Okay, that's another Cython function.
This time there's another (although less significant) contributor with ~3%: values = self.asobject
. But I'll ignore this for now, because we're interested in the major contributors.
astype
This is the astype_unicode
function:
cpdef ndarray[object] astype_unicode(ndarray arr):
cdef:
Py_ssize_t i, n = arr.size
ndarray[object] result = np.empty(n, dtype=object)
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, unicode(arr[i]))
return result
Source
This function uses this helper:
cdef inline set_value_at_unsafe(ndarray arr, object loc, object value):
cdef:
Py_ssize_t i, sz
if is_float_object(loc):
casted = int(loc)
if casted == loc:
loc = casted
i = <Py_ssize_t> loc
sz = cnp.PyArray_SIZE(arr)
if i < 0:
i += sz
elif i >= sz:
raise IndexError('index out of bounds')
assign_value_1d(arr, i, value)
Source
Which itself uses this C function:
PANDAS_INLINE int assign_value_1d(PyArrayObject* ap, Py_ssize_t _i,
PyObject* v) {
npy_intp i = (npy_intp)_i;
char* item = (char*)PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);
return PyArray_DESCR(ap)->f->setitem(v, item, ap);
}
Source
apply
This is the implementation of the map_infer
function:
def map_infer(ndarray arr, object f, bint convert=1):
cdef:
Py_ssize_t i, n
ndarray[object] result
object val
n = len(arr)
result = np.empty(n, dtype=object)
for i in range(n):
val = f(util.get_value_at(arr, i))
# unbox 0-dim arrays, GH #690
if is_array(val) and PyArray_NDIM(val) == 0:
# is there a faster way to unbox?
val = val.item()
result[i] = val
if convert:
return maybe_convert_objects(result,
try_float=0,
convert_datetime=0,
convert_timedelta=0)
return result
Source
With this helper:
cdef inline object get_value_at(ndarray arr, object loc):
cdef:
Py_ssize_t i, sz
int casted
if is_float_object(loc):
casted = int(loc)
if casted == loc:
loc = casted
i = <Py_ssize_t> loc
sz = cnp.PyArray_SIZE(arr)
if i < 0 and sz > 0:
i += sz
elif i >= sz or sz == 0:
raise IndexError('index out of bounds')
return get_value_1d(arr, i)
Source
Which uses this C function:
PANDAS_INLINE PyObject* get_value_1d(PyArrayObject* ap, Py_ssize_t i) {
char* item = (char*)PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);
return PyArray_Scalar(item, PyArray_DESCR(ap), (PyObject*)ap);
}
Source
There are some differences between the Cython codes that are called eventually.
The one taken by astype
uses unicode
while the apply
path uses the function passed in. Let's see if that makes a difference (again IPython/Jupyter makes it very easy to compile Cython code yourself):
%load_ext cython
%%cython
import numpy as np
cimport numpy as np
cpdef object func_called_by_astype(np.ndarray arr):
cdef np.ndarray[object] ret = np.empty(arr.size, dtype=object)
for i in range(arr.size):
ret[i] = unicode(arr[i])
return ret
cpdef object func_called_by_apply(np.ndarray arr, object f):
cdef np.ndarray[object] ret = np.empty(arr.size, dtype=object)
for i in range(arr.size):
ret[i] = f(arr[i])
return ret
Timing:
import numpy as np
arr = np.random.randint(0, 10000, 1000000)
%timeit func_called_by_astype(arr)
514 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_apply(arr, str)
632 ms ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Okay, there is a difference but it's wrong, it would actually indicate that apply
would be slightly slower.
But remember the asobject
call that I mentioned earlier in the apply
function? Could that be the reason? Let's see:
import numpy as np
arr = np.random.randint(0, 10000, 1000000)
%timeit func_called_by_astype(arr)
557 ms ± 33.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_apply(arr.astype(object), str)
317 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Now it looks better. The conversion to an object array made the function called by apply much faster. There is a simple reason for this: str
is a Python function and these are generally much faster if you already have Python objects and NumPy (or Pandas) don't need to create a Python wrapper for the value stored in the array (which is generally not a Python object, except when the array is of dtype object
).
However that doesn't explain the huge difference that you've seen. My suspicion is that there is actually an additional difference in the ways the arrays are iterated over and the elements are set in the result. Very likely the:
val = f(util.get_value_at(arr, i))
if is_array(val) and PyArray_NDIM(val) == 0:
val = val.item()
result[i] = val
part of the map_infer
function is faster than:
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, unicode(arr[i]))
which is called by the astype(str)
path. The comments of the first function seem to indicate that the writer of map_infer
actually tried to make the code as fast as possible (see the comment about "is there a faster way to unbox?" while the other one maybe was written without special care about performance. But that's just a guess.
Also on my computer I'm actually quite close to the performance of the x.astype(str)
and x.apply(str)
already:
import numpy as np
arr = np.random.randint(0, 100, 1000000)
s = pd.Series(arr)
%timeit s.astype(str)
535 ms ± 23.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_astype(arr)
547 ms ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit s.apply(str)
216 ms ± 8.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_apply(arr.astype(object), str)
272 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note that I also checked some other variants that return a different result:
%timeit s.values.astype(str) # array of strings
407 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(map(str, s.values.tolist())) # list of strings
184 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Interestingly the Python loop with list
and map
seems to be the fastest on my computer.
I actually made a small benchmark including plot:
import pandas as pd
import simple_benchmark
def Series_astype(series):
return series.astype(str)
def Series_apply(series):
return series.apply(str)
def Series_tolist_map(series):
return list(map(str, series.values.tolist()))
def Series_values_astype(series):
return series.values.astype(str)
arguments = {2**i: pd.Series(np.random.randint(0, 100, 2**i)) for i in range(2, 20)}
b = simple_benchmark.benchmark(
[Series_astype, Series_apply, Series_tolist_map, Series_values_astype],
arguments,
argument_name='Series size'
)
%matplotlib notebook
b.plot()
Note that it's a log-log plot because of the huge range of sizes I covered in the benchmark. However lower means faster here.
The results may be different for different versions of Python/NumPy/Pandas. So if you want to compare it, these are my versions:
Versions
--------
Python 3.6.5
NumPy 1.14.2
Pandas 0.22.0
Performance
It's worth looking at actual performance before beginning any investigation since, contrary to popular opinion, list(map(str, x))
appears to be slower than x.apply(str)
.
import pandas as pd, numpy as np
### Versions: Pandas 0.20.3, Numpy 1.13.1, Python 3.6.2 ###
x = pd.Series(np.random.randint(0, 100, 100000))
%timeit x.apply(str) # 42ms (1)
%timeit x.map(str) # 42ms (2)
%timeit x.astype(str) # 559ms (3)
%timeit [str(i) for i in x] # 566ms (4)
%timeit list(map(str, x)) # 536ms (5)
%timeit x.values.astype(str) # 25ms (6)
Points worth noting:
lambda
function is used].Why is x.map / x.apply fast?
This appears to be because it uses fast compiled Cython code:
cpdef ndarray[object] astype_str(ndarray arr):
cdef:
Py_ssize_t i, n = arr.size
ndarray[object] result = np.empty(n, dtype=object)
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, str(arr[i]))
return result
Why is x.astype(str) slow?
Pandas applies str
to each item in the series, not using the above Cython.
Hence performance is comparable to [str(i) for i in x]
/ list(map(str, x))
.
Why is x.values.astype(str) so fast?
Numpy does not apply a function on each element of the array. One description of this I found:
If you did
s.values.astype(str)
what you get back is an object holdingint
. This isnumpy
doing the conversion, whereas pandas iterates over each item and callsstr(item)
on it. So if you dos.astype(str)
you have an object holdingstr
.
There is a technical reason why the numpy version hasn't been implemented in the case of no-nulls.