Pandas - group by consecutive ranges

后端 未结 2 1940
太阳男子
太阳男子 2021-01-04 20:54

I have a dataframe with the following structure - Start, End and Height.

Some properties of the dataframe:

  • A row in the dataframe always starts from wh
相关标签:
2条回答
  • 2021-01-04 21:15

    A way to do that :

    df = pd.DataFrame([[1,3,10], [4,10,7], [11,17,6], [18,26, 12],
    [27,30, 15], [31,40,6], [41, 42, 6]], columns=['start','end', 'height'])
    

    Use cut to make groups :

    df['groups']=pd.cut(df.height,[-1,0,5,10,15,1000])
    

    Find break points :

    df['categories']=(df.groups!=df.groups.shift()).cumsum()
    

    Then df is :

    """
       start  end  height    groups  categories
    0      1    3      10   (5, 10]           0
    1      4   10       7   (5, 10]           0
    2     11   17       6   (5, 10]           0
    3     18   26      12  (10, 15]           1
    4     27   30      15  (10, 15]           1
    5     31   40       6   (5, 10]           2
    6     41   42       6   (5, 10]           2
    """
    

    Define interesting data :

    f = {'start':['first'],'end':['last'], 'groups':['first']}
    

    And use the groupby.agg function :

    df.groupby('categories').agg(f)
    """
                  groups  end start
                   first last first
    categories                     
    0            (5, 10]   17     1
    1           (10, 15]   30    18
    2            (5, 10]   42    31
    """
    
    0 讨论(0)
  • 2021-01-04 21:37

    You can use cut with groupby by cut and Series with cumsum for generating groups and aggregate by agg, first and last:

    bins = [-1,0,1,5,10,15,100]
    print bins
    [-1, 0, 1, 5, 10, 15, 100]
    
    cut_ser = pd.cut(d['height'], bins=bins)
    print cut_ser
    0     (5, 10]
    1     (5, 10]
    2     (5, 10]
    3    (10, 15]
    4    (10, 15]
    5     (5, 10]
    6     (5, 10]
    Name: height, dtype: category
    Categories (6, object): [(-1, 0] < (0, 1] < (1, 5] < (5, 10] < (10, 15] < (15, 100]]
    
    print (cut_ser.shift() != cut_ser).cumsum()
    0    0
    1    0
    2    0
    3    1
    4    1
    5    2
    6    2
    Name: height, dtype: int32
    
    print d.groupby([(cut_ser.shift() != cut_ser).cumsum(), cut_ser])
           .agg({'start' : 'first','end' : 'last'})
           .reset_index(level=1).reset_index(drop=True)
           .rename(columns={'height':'height_grouped'})
    
      height_grouped  start  end
    0        (5, 10]      1   17
    1       (10, 15]     18   30
    2        (5, 10]     31   42
    

    EDIT:

    Timings:

    In [307]: %timeit a(df)
    100 loops, best of 3: 5.45 ms per loop
    
    In [308]: %timeit b(d)
    The slowest run took 4.45 times longer than the fastest. This could mean that an intermediate result is being cached 
    100 loops, best of 3: 3.28 ms per loop
    

    Code:

    d = pd.DataFrame([[1,3,5], [4,10,7], [11,17,6], [18,26, 12], [27,30, 15], [31,40,6], [41, 42, 7]], columns=['start','end', 'height'])
    print d
    
    df = d.copy()
    
    
    def a(df):
        df['groups']=pd.cut(df.height,[-1,0,5,10,15,1000])
        df['categories']=(df.groups!=df.groups.shift()).cumsum()
        f = {'start':['first'],'end':['last'], 'groups':['first']}
        return df.groupby('categories').agg(f)
    
    def b(d):
        bins = [-1,0,1,5,10,15,100]
        cut_ser = pd.cut(d['height'], bins=bins)
        return d.groupby([(cut_ser.shift() != cut_ser).cumsum(), cut_ser]).agg({'start' : 'first','end' : 'last'}).reset_index(level=1).reset_index(drop=True).rename(columns={'height':'height_grouped'})
    
    
    print a(df)    
    print b(d)
    
    0 讨论(0)
提交回复
热议问题