Numpy vectorization messes up data type (2)

前端 未结 3 1098
小蘑菇
小蘑菇 2020-12-21 06:22

I\'m having unwanted behaviour come out of np.vectorize, namely, it changes the datatype of the argument going into the original function. My original question

相关标签:
3条回答
  • 2020-12-21 07:02

    Just as in the original question, I can "solve" the problem by forcing the incoming argument to be a pandas datetime object, by adding dt = pd.to_datetime(dt) before the first if-statement of the function.

    To be honest, this feels like patching-up something that's broken and should not be used. I'll just use .apply instead and take the performance hit. Anyone that feels there's a better solution is very much invited to share :)

    0 讨论(0)
  • 2020-12-21 07:16

    I think @rpanai answer on the original post is still the best. Here I share my tests:

    def qualifies(dt, excluded_months = []):
        if dt.day < 5:
            return False
        if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
            return False
        if dt.month in excluded_months:
            return False
        return True
    
    def new_qualifies(dt, excluded_months = []):
        dt = pd.Timestamp(dt)
        if dt.day < 5:
            return False
        if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
            return False
        if dt.month in excluded_months:
            return False
        return True
    
    df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=12000)})
    

    apply method:

    %%timeit
    df['qualifies1'] = df['date'].apply(lambda x: qualifies(x, [3, 8]))
    

    385 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


    conversion method:

    %%timeit
    df['qualifies1'] = df['date'].apply(lambda x: new_qualifies(x, [3, 8]))
    

    389 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


    vectorized code:

    %%timeit
    df['qualifies2'] =  np.logical_not((df['date'].dt.day<5).values | \
        ((df['date']+pd.tseries.offsets.MonthBegin(1)-df['date']).dt.days < 5).values |\
        (df['date'].dt.month.isin([3, 8])).values)
    

    4.83 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    0 讨论(0)
  • 2020-12-21 07:18

    Summary

    If using np.vectorize it's best to specify otypes. In this case, the error is caused by the trial calculation the vectorize uses when otypes is not specified. An alternative is to pass the Series as an object type array.

    np.vectorize has a performance disclaimer. np.frompyfunc may be faster, or even a list comprehension.

    testing vectorize

    Let's define a simpler function - one that displays the type of the argument:

    In [31]: def foo(dt, excluded_months=[]): 
        ...:     print(dt,type(dt)) 
        ...:     return True 
    

    And a smaller dataframe:

    In [32]: df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', perio
        ...: ds=5)})                                                                
    In [33]: df                                                                     
    Out[33]: 
            date
    0 2020-01-01
    1 2020-01-08
    2 2020-01-15
    3 2020-01-22
    4 2020-01-29
    

    Testing vectorize. (vectorize docs says using the excluded parameter degrades performance, so I'm using lambda as used by with apply):

    In [34]: np.vectorize(lambda x:foo(x,[3,8]))(df['date'])                        
    2020-01-01T00:00:00.000000000 <class 'numpy.datetime64'>
    2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    2020-01-08 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    2020-01-15 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    2020-01-22 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    2020-01-29 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    Out[34]: array([ True,  True,  True,  True,  True])
    

    That first line is the datetime64 that gives problems. The other lines are the orginal pandas objects. If I specify the otypes, that problem goes away:

    In [35]: np.vectorize(lambda x:foo(x,[3,8]), otypes=['bool'])(df['date'])       
    2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    2020-01-08 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    2020-01-15 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    2020-01-22 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    2020-01-29 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    Out[35]: array([ True,  True,  True,  True,  True])
    

    the apply:

    In [36]: df['date'].apply(lambda x: foo(x, [3, 8]))                             
    2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    2020-01-08 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    2020-01-15 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    2020-01-22 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    2020-01-29 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    Out[36]: 
    0    True
    1    True
    2    True
    3    True
    4    True
    Name: date, dtype: bool
    

    A datetime64 dtype is produced by wrapping the the Series in np.array.

    In [37]: np.array(df['date'])                                                   
    Out[37]: 
    array(['2020-01-01T00:00:00.000000000', '2020-01-08T00:00:00.000000000',
           '2020-01-15T00:00:00.000000000', '2020-01-22T00:00:00.000000000',
           '2020-01-29T00:00:00.000000000'], dtype='datetime64[ns]')
    

    Apparently np.vectorize is doing this sort of wrapping when performing the initial trial calculation, but not when doing the main iterations. Specifying the otypes skips that trial calculation. That trial calculation has caused problems in other SO, though this is a more obscure case.

    In that past when I've tested np.vectorize it is slower than a more explicit iteration. It does have a clear performance disclaimer. It's most valuable when the function takes several inputs, and needs the benefit of broadcasting. It's hard to justify when using only one argument.

    np.frompyfunc underlies vectorize, but returns an object dtype. Often it is 2x faster than explicit iteration on an array, though similar in speed to iteration on a list. It seems to be most useful when creating and working with a numpy array of objects. I haven't gotten it working in this case.

    vectorize code

    The np.vectorize code is in np.lib.function_base.py.

    If otypes is not specified, the code does:

            args = [asarray(arg) for arg in args]
            inputs = [arg.flat[0] for arg in args]
            outputs = func(*inputs)
    

    It makes each argument (here only one) into an array, and takes the first element. And then passes that to the func. As Out[37] shows, that will be a datetime64 object.

    frompyfunc

    To use frompyfunc, I need to convert the dtype of df['date']:

    In [68]: np.frompyfunc(lambda x:foo(x,[3,8]), 1,1)(df['date'])                  
    1577836800000000000 <class 'int'>
    1578441600000000000 <class 'int'>
    ...
    

    without it, it passes int to the function, with it, it passes the pandas time objects:

    In [69]: np.frompyfunc(lambda x:foo(x,[3,8]), 1,1)(df['date'].astype(object))   
    2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    2020-01-08 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    ...
    

    So this use of qualifies works:

    In [71]: np.frompyfunc(lambda x:qualifies(x,[3,8]),1,1)(df['date'].astype(object))                                                                     
    Out[71]: 
    0    False
    1     True
    2     True
    3     True
    4    False
    Name: date, dtype: object
    

    object dtype

    For the main iteration, np.vectorize does

          ufunc = frompyfunc(_func, len(args), nout)
          # Convert args to object arrays first
            inputs = [array(a, copy=False, subok=True, dtype=object)
                      for a in args]
            outputs = ufunc(*inputs)
    

    That explains why vectorize with otypes works - it is using frompyfunc with an object dtype input. Contrast this with Out[37]:

    In [74]: np.array(df['date'], dtype=object)                                     
    Out[74]: 
    array([Timestamp('2020-01-01 00:00:00'), Timestamp('2020-01-08 00:00:00'),
           Timestamp('2020-01-15 00:00:00'), Timestamp('2020-01-22 00:00:00'),
           Timestamp('2020-01-29 00:00:00')], dtype=object)
    

    And an alternative to specifying otypes is to make sure you are passing object dtype to vectorize:

    In [75]: np.vectorize(qualifies, excluded=[1])(df['date'].astype(object), [3, 8])                                                                      
    Out[75]: array([False,  True,  True,  True, False])
    

    This appears to be the fastest version:

    np.frompyfunc(lambda x: qualifies(x,[3,8]),1,1)(np.array(df['date'],object))    
    

    or better yet, a plain Python iteration:

    [qualifies(x,[3,8]) for x in df['date']] 
    
    0 讨论(0)
提交回复
热议问题