I\'m having unwanted behaviour come out of np.vectorize
, namely, it changes the datatype of the argument going into the original function. My original question
Just as in the original question, I can "solve" the problem by forcing the incoming argument to be a pandas
datetime object, by adding dt = pd.to_datetime(dt)
before the first if
-statement of the function.
To be honest, this feels like patching-up something that's broken and should not be used. I'll just use .apply
instead and take the performance hit. Anyone that feels there's a better solution is very much invited to share :)
I think @rpanai answer on the original post is still the best. Here I share my tests:
def qualifies(dt, excluded_months = []):
if dt.day < 5:
return False
if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
return False
if dt.month in excluded_months:
return False
return True
def new_qualifies(dt, excluded_months = []):
dt = pd.Timestamp(dt)
if dt.day < 5:
return False
if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
return False
if dt.month in excluded_months:
return False
return True
df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=12000)})
apply method:
%%timeit
df['qualifies1'] = df['date'].apply(lambda x: qualifies(x, [3, 8]))
385 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
conversion method:
%%timeit
df['qualifies1'] = df['date'].apply(lambda x: new_qualifies(x, [3, 8]))
389 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vectorized code:
%%timeit
df['qualifies2'] = np.logical_not((df['date'].dt.day<5).values | \
((df['date']+pd.tseries.offsets.MonthBegin(1)-df['date']).dt.days < 5).values |\
(df['date'].dt.month.isin([3, 8])).values)
4.83 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If using np.vectorize
it's best to specify otypes
. In this case, the error is caused by the trial calculation the vectorize
uses when otypes
is not specified. An alternative is to pass the Series as an object type array.
np.vectorize
has a performance disclaimer. np.frompyfunc
may be faster, or even a list comprehension.
Let's define a simpler function - one that displays the type of the argument:
In [31]: def foo(dt, excluded_months=[]):
...: print(dt,type(dt))
...: return True
And a smaller dataframe:
In [32]: df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', perio
...: ds=5)})
In [33]: df
Out[33]:
date
0 2020-01-01
1 2020-01-08
2 2020-01-15
3 2020-01-22
4 2020-01-29
Testing vectorize
. (vectorize
docs says using the excluded
parameter degrades performance, so I'm using lambda
as used by with apply
):
In [34]: np.vectorize(lambda x:foo(x,[3,8]))(df['date'])
2020-01-01T00:00:00.000000000 <class 'numpy.datetime64'>
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-08 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-15 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-22 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-29 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Out[34]: array([ True, True, True, True, True])
That first line is the datetime64
that gives problems. The other lines are the orginal pandas objects. If I specify the otypes
, that problem goes away:
In [35]: np.vectorize(lambda x:foo(x,[3,8]), otypes=['bool'])(df['date'])
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-08 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-15 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-22 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-29 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Out[35]: array([ True, True, True, True, True])
the apply:
In [36]: df['date'].apply(lambda x: foo(x, [3, 8]))
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-08 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-15 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-22 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-29 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Out[36]:
0 True
1 True
2 True
3 True
4 True
Name: date, dtype: bool
A datetime64
dtype is produced by wrapping the the Series in np.array
.
In [37]: np.array(df['date'])
Out[37]:
array(['2020-01-01T00:00:00.000000000', '2020-01-08T00:00:00.000000000',
'2020-01-15T00:00:00.000000000', '2020-01-22T00:00:00.000000000',
'2020-01-29T00:00:00.000000000'], dtype='datetime64[ns]')
Apparently np.vectorize
is doing this sort of wrapping when performing the initial trial calculation, but not when doing the main iterations. Specifying the otypes
skips that trial calculation. That trial calculation has caused problems in other SO, though this is a more obscure case.
In that past when I've tested np.vectorize
it is slower than a more explicit iteration. It does have a clear performance disclaimer. It's most valuable when the function takes several inputs, and needs the benefit of broadcasting. It's hard to justify when using only one argument.
np.frompyfunc
underlies vectorize
, but returns an object dtype. Often it is 2x faster than explicit iteration on an array, though similar in speed to iteration on a list. It seems to be most useful when creating and working with a numpy array of objects. I haven't gotten it working in this case.
The np.vectorize
code is in np.lib.function_base.py
.
If otypes
is not specified, the code does:
args = [asarray(arg) for arg in args]
inputs = [arg.flat[0] for arg in args]
outputs = func(*inputs)
It makes each argument (here only one) into an array, and takes the first element. And then passes that to the func
. As Out[37]
shows, that will be a datetime64
object.
To use frompyfunc
, I need to convert the dtype of df['date']
:
In [68]: np.frompyfunc(lambda x:foo(x,[3,8]), 1,1)(df['date'])
1577836800000000000 <class 'int'>
1578441600000000000 <class 'int'>
...
without it, it passes int
to the function, with it, it passes the pandas time objects:
In [69]: np.frompyfunc(lambda x:foo(x,[3,8]), 1,1)(df['date'].astype(object))
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-08 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
...
So this use of qualifies
works:
In [71]: np.frompyfunc(lambda x:qualifies(x,[3,8]),1,1)(df['date'].astype(object))
Out[71]:
0 False
1 True
2 True
3 True
4 False
Name: date, dtype: object
For the main iteration, np.vectorize
does
ufunc = frompyfunc(_func, len(args), nout)
# Convert args to object arrays first
inputs = [array(a, copy=False, subok=True, dtype=object)
for a in args]
outputs = ufunc(*inputs)
That explains why vectorize
with otypes
works - it is using frompyfunc
with an object dtype input. Contrast this with Out[37]
:
In [74]: np.array(df['date'], dtype=object)
Out[74]:
array([Timestamp('2020-01-01 00:00:00'), Timestamp('2020-01-08 00:00:00'),
Timestamp('2020-01-15 00:00:00'), Timestamp('2020-01-22 00:00:00'),
Timestamp('2020-01-29 00:00:00')], dtype=object)
And an alternative to specifying otypes
is to make sure you are passing object dtype to vectorize
:
In [75]: np.vectorize(qualifies, excluded=[1])(df['date'].astype(object), [3, 8])
Out[75]: array([False, True, True, True, False])
This appears to be the fastest version:
np.frompyfunc(lambda x: qualifies(x,[3,8]),1,1)(np.array(df['date'],object))
or better yet, a plain Python iteration:
[qualifies(x,[3,8]) for x in df['date']]