vectorize complex slicing with pandas dataframe

后端 未结 1 1045
灰色年华
灰色年华 2020-12-30 17:00

I\'d like to be able to vectorize, for speed purposes, this piece of code. the purpose is to calculate a function, in this case a standard deviation, from a tuple of pair of

相关标签:
1条回答
  • 2020-12-30 17:50

    Vectorized standard deviation across ranges in an array

    def get_ranges_arr(starts,ends):
        # Taken from http://stackoverflow.com/a/37626057/3293881
        counts = ends - starts
        counts_csum = counts.cumsum()
        id_arr = np.ones(counts_csum[-1],dtype=int)
        id_arr[0] = starts[0]
        id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
        return id_arr.cumsum()
    
    def ranged_std(arr,starts,ends):
        # Get all indices and the IDs corresponding to same groups
        idx = get_ranges_arr(starts,ends)
        id_arr = np.repeat(np.arange(starts.size),ends-starts)
        
        # Extract relevant data
        slice_arr = arr[idx]
        
        # Simulate standard deviation implementation for a number of groups
        # using id_arr as the basis to perform various mathematical operations
        # within each group. Since, std. deviation performs sum/mean reduction,
        # we can simply use np.bincount for an efficient implementation.
        # Std. deviation formula used :
        #https://github.com/numpy/numpy/blob/v1.11.0/numpy/core/fromnumeric.py#L2939
        grp_counts = np.bincount(id_arr)
        mean_vals = np.bincount(id_arr,slice_arr)/grp_counts
        abs_vals = np.abs(slice_arr - mean_vals[id_arr])**2
        return np.sqrt(np.bincount(id_arr,abs_vals)/grp_counts)
    

    Sample run (verify against a loopy version)

    In [173]: arr = np.random.randint(0,9,(20))
    
    In [174]: starts = np.array([2,6,11])
    
    In [175]: ends = np.array([8,9,15])
    
    In [176]: [np.std(arr[i:j]) for i,j in zip(starts,ends)]
    Out[176]: [1.9720265943665387, 0.81649658092772603, 0.82915619758884995]
    
    In [177]: ranged_std(arr,starts,ends)
    Out[177]: array([ 1.97202659,  0.81649658,  0.8291562 ])    
    

    Runtime test

    Case #1 : Very small number of ranges 3

    In [21]: arr = np.random.randint(0,9,(20))
    
    In [22]: starts = np.array([2,6,11])
    
    In [23]: ends = np.array([8,9,15])
    
    In [24]: %timeit [np.std(arr[i:j]) for i,j in zip(starts,ends)]
    10000 loops, best of 3: 146 µs per loop
    
    In [25]: %timeit ranged_std(arr,starts,ends)
    10000 loops, best of 3: 45 µs per loop
    

    Case #2 : Decent number of ranges 1000

    In [32]: arr = np.random.randint(0,9,(1010))
    
    In [33]: starts = np.random.randint(0,9,(1000))
    
    In [34]: ends = starts + np.random.randint(0,9,(1000))
    
    In [35]: %timeit [np.std(arr[i:j]) for i,j in zip(starts,ends)]
    10 loops, best of 3: 47.5 ms per loop
    
    In [36]: %timeit ranged_std(arr,starts,ends)
    1000 loops, best of 3: 217 µs per loop
    

    Case #3 : Large number of ranges 10000

    In [60]: arr = np.random.randint(0,9,(1010))
    
    In [61]: arr = np.random.randint(0,9,(10010))
    
    In [62]: starts = np.random.randint(0,9,(10000))
    
    In [63]: ends = starts + np.random.randint(0,9,(10000))
    
    In [64]: %timeit [np.std(arr[i:j]) for i,j in zip(starts,ends)]
    1 loops, best of 3: 474 ms per loop
    
    In [65]: %timeit ranged_std(arr,starts,ends)
    100 loops, best of 3: 2.17 ms per loop
    

    Really amazing speedups of 200x+!


    Using ranged_std to solve our case

    # Get start, stop numeric indices as needed for getting ranges array later on
    starts = asd_1.index.searchsorted(index_1)
    ends = asd_1.index.searchsorted(index_2)
    
    # Create final dataframe output using ranged_std func
    df = pd.DataFrame(ranged_std(asd_1.values,starts,ends+1),index=index_1)
    

    Sample run for verification -

    In [17]: asd_1 = pd.Series(0.01 * np.random.randn(252), index=\
        ...:                   pd.date_range('2011-1-1', periods=252))
        ...: 
        ...: index_1 = pd.to_datetime(['2011-2-2', '2011-4-3', '2011-5-1',])
        ...: index_2 = pd.to_datetime(['2011-2-15', '2011-4-16', '2011-5-17',])
        ...: 
        ...: index_tot = list(zip(index_1,index_2))
        ...: aux_learning_std = pd.DataFrame([np.nanstd(asd_1.loc[i:j]) for i, j in \
        ...:                                                index_tot], index=index_1)
        ...: 
    
    In [18]: starts = asd_1.index.searchsorted(index_1)
        ...: ends = asd_1.index.searchsorted(index_2)
        ...: df = pd.DataFrame(ranged_std(asd_1.values,starts,ends+1),index=index_1)
        ...: 
    
    In [19]: aux_learning_std
    Out[19]: 
                       0
    2011-02-02  0.007244
    2011-04-03  0.012862
    2011-05-01  0.010155
    
    In [20]: df
    Out[20]: 
                       0
    2011-02-02  0.007244
    2011-04-03  0.012862
    2011-05-01  0.010155
    
    0 讨论(0)
提交回复
热议问题