Fast, python-ish way of ranking chunks of 1's in numpy array?

后端 未结 2 1779
忘了有多久
忘了有多久 2021-01-05 03:29

I have a numpy array consisting of 0\'s and 1\'s. Each sequence of 1\'s within the array stands for occurrence of one event. I want to

相关标签:
2条回答
  • 2021-01-05 03:34

    A really cool way to do this is using the DBSCAN clustering algorithm. It might not be the most efficient for this specific task, BUT it is resilient in case you want to give a minimum number of 1's per event, or allow a gap of a certain number of zeroes within an event.

    from sklearn.cluster import DBSCAN
    import numpy as np
    
    max_gap = 1
    min_samples = 1
    
    # Get indices of every element that belongs to a certain event
    input_values = np.array([0,0,0,1,1,1,0,0,0,1,1,0,0,0,1,1,1,1])  
    positives_indices = np.where(input_values > 0)[0]
    
    # Turn the indices into a 2D array of so called 'examples'
    X = positives_indices.reshape(-1, 1)
    
    # Train a model and transform the data in one
    clustering = DBSCAN(eps=max_gap, min_samples=min_samples) \
        .fit_predict(X)
    
    # Get results, yields (index, event_id)
    zip(X, clustering)
    
    0 讨论(0)
  • 2021-01-05 03:55

    You want to label and luckily, there's one with SciPy, scipy.ndimage.label -

    In [43]: from scipy.ndimage import label
    
    In [47]: out = label(arr)[0]
    
    In [48]: np.where(arr==0,np.nan,out-1)
    Out[48]: 
    array([nan, nan, nan,  0.,  0.,  0., nan, nan, nan,  1.,  1., nan, nan,
           nan,  2.,  2.,  2.,  2.])
    

    Another with some NumPy work -

    def rank_chunks(arr):
        m = np.r_[False,arr.astype(bool)]
        idx = np.flatnonzero(m[:-1] < m[1:])
        id_ar = np.zeros(len(arr),dtype=float)
        id_ar[idx[1:]] = 1
        out = id_ar.cumsum()
        out[arr==0] = np.nan
        return out
    

    Another with masking + np.repeat -

    def rank_chunks_v2(arr):
        m = np.r_[False,arr.astype(bool),False]
        idx = np.flatnonzero(m[:-1] != m[1:])
        l = idx[1::2]-idx[::2]
        out = np.full(len(arr),np.nan,dtype=float)
        out[arr!=0] = np.repeat(np.arange(len(l)),l)
        return out
    

    Timings (tiling given input to 1Mx) -

    In [153]: arr_big = np.tile(arr,1000000)
    
    In [154]: %timeit np.where(arr_big==0,np.nan,label(arr_big)[0]-1)
         ...: %timeit rank_chunks(arr_big)
         ...: %timeit rank_chunks_v2(arr_big)
    1 loop, best of 3: 312 ms per loop
    1 loop, best of 3: 263 ms per loop
    1 loop, best of 3: 229 ms per loop
    
    0 讨论(0)
提交回复
热议问题