pandas how to find continuous values in a series whose differences are within a certain distance

后端 未结 2 605
独厮守ぢ
独厮守ぢ 2021-02-15 02:44

I have a pandas Series that is composed of ints

a = np.array([1,2,3,5,7,10,13,16,20])
pd.Series(a)

0  1
1  2
2  3
3  5
4          


        
2条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-02-15 03:22

    Here's one approach -

    np.split(a,np.flatnonzero(np.diff(a)>d)+1)
    

    As a function to output list of lists -

    def splitme(a,d) : 
        return list(map(list,np.split(a,np.flatnonzero(np.diff(a)>d)+1)))
    

    For performance, I would suggest using zip to get the start, stop indices and then slicing, thus avoiding np.split which might prove to be the bottleneck -

    def splitme_zip(a,d) : 
        m = np.concatenate(([True],a[1:] > a[:-1] + d,[True]))
        idx = np.flatnonzero(m)
        l = a.tolist()
        return [l[i:j] for i,j in zip(idx[:-1],idx[1:])]
    

    If you need the output as a list of arrays, skip the list conversion with .tolist/map(list,).

    Sample runs -

    In [122]: a = np.array([1,2,3,5,7,10,13,16,20])
    
    In [123]: splitme(a,1)
    Out[123]: [[1, 2, 3], [5], [7], [10], [13], [16], [20]]
    
    In [124]: splitme(a,2)
    Out[124]: [[1, 2, 3, 5, 7], [10], [13], [16], [20]]
    
    In [125]: splitme(a,3)
    Out[125]: [[1, 2, 3, 5, 7, 10, 13, 16], [20]]
    

    Runtime test -

    In [180]: a = np.sort(np.random.randint(1,10000*2,(10000)))
    
    In [181]: s = pd.Series(a)
    
    In [182]: d = 3
    
    In [183]: %timeit pandas_way(s,d) #@cᴏʟᴅsᴘᴇᴇᴅ's soln
    10 loops, best of 3: 55.1 ms per loop
    
    In [184]: %timeit np.split(a,np.flatnonzero(np.diff(a)>d)+1)
         ...: %timeit splitme(a,d)
         ...: %timeit splitme_zip(a,d)
    1000 loops, best of 3: 1.47 ms per loop
    100 loops, best of 3: 2.87 ms per loop
    1000 loops, best of 3: 516 µs per loop
    
    In [185]: a
    Out[185]: array([    2,     2,     2, ..., 19992, 19996, 19999])
    

提交回复
热议问题