Finding consecutive segments in a pandas data frame

前端 未结 2 505
有刺的猬
有刺的猬 2020-11-28 04:48

I have a pandas.DataFrame with measurements taken at consecutive points in time. Along with each measurement the system under observation had a distinct state at each point

相关标签:
2条回答
  • 2020-11-28 05:13

    You could use np.diff() to test where a segment starts/ends and iterate over those results. Its a very simple solution, so probably not the most performent one.

    a = np.array([3,3,3,3,3,4,4,4,4,4,1,1,1,1,4,4,12,12,12])
    
    prev = 0
    splits = np.append(np.where(np.diff(a) != 0)[0],len(a)+1)+1
    
    for split in splits:
        print np.arange(1,a.size+1,1)[prev:split]
        prev = split
    

    Results in:

    [1 2 3 4 5]
    [ 6  7  8  9 10]
    [11 12 13 14]
    [15 16]
    [17 18 19]
    
    0 讨论(0)
  • 2020-11-28 05:34

    One-liner:

    df.reset_index().groupby('A')['index'].apply(np.array)
    

    Code for example:

    In [1]: import numpy as np
    
    In [2]: from pandas import *
    
    In [3]: df = DataFrame([3]*4+[4]*4+[1]*4, columns=['A'])
    In [4]: df
    Out[4]:
        A
    0   3
    1   3
    2   3
    3   3
    4   4
    5   4
    6   4
    7   4
    8   1
    9   1
    10  1
    11  1
    
    In [5]: df.reset_index().groupby('A')['index'].apply(np.array)
    Out[5]:
    A
    1    [8, 9, 10, 11]
    3      [0, 1, 2, 3]
    4      [4, 5, 6, 7]
    

    You can also directly access the information from the groupby object:

    In [1]: grp = df.groupby('A')
    
    In [2]: grp.indices
    Out[2]:
    {1L: array([ 8,  9, 10, 11], dtype=int64),
     3L: array([0, 1, 2, 3], dtype=int64),
     4L: array([4, 5, 6, 7], dtype=int64)}
    
    In [3]: grp.indices[3]
    Out[3]: array([0, 1, 2, 3], dtype=int64)
    

    To address the situation that DSM mentioned you could do something like:

    In [1]: df['block'] = (df.A.shift(1) != df.A).astype(int).cumsum()
    
    In [2]: df
    Out[2]:
        A  block
    0   3      1
    1   3      1
    2   3      1
    3   3      1
    4   4      2
    5   4      2
    6   4      2
    7   4      2
    8   1      3
    9   1      3
    10  1      3
    11  1      3
    12  3      4
    13  3      4
    14  3      4
    15  3      4
    

    Now groupby both columns and apply the lambda function:

    In [77]: df.reset_index().groupby(['A','block'])['index'].apply(np.array)
    Out[77]:
    A  block
    1  3          [8, 9, 10, 11]
    3  1            [0, 1, 2, 3]
       4        [12, 13, 14, 15]
    4  2            [4, 5, 6, 7]
    
    0 讨论(0)
提交回复
热议问题