Count of a value in consecutive timestamp in pandas

问题

Hour              Site
01/08/2020 00:00    A
01/08/2020 00:00    B
01/08/2020 00:00    C
01/08/2020 00:00    D
01/08/2020 01:00    A
01/08/2020 01:00    B
01/08/2020 01:00    E
01/08/2020 01:00    F
01/08/2020 02:00    A
01/08/2020 02:00    E
01/08/2020 03:00    C
01/08/2020 03:00    G
 …..    
01/08/2020 04:00    x
01/08/2020 04:00    s

 …..    

01/08/2020 23:00    G
02/08/2020 00:00    G

I have a dataframe like above. I want to count how many times a site comes in consecutive hours & start and end timestamp. wheres in each hour there are multiple sites. For example site A appears in in 3 consecutive timestamp, then again in one timestamp. I want an output like below, or in more effective format.

Hour              Site count    period_start    Period_end
01/08/2020 00:00    A   3   01/08/2020 00:00    01/08/2020 03:00
01/08/2020 00:00    B   2   01/08/2020 00:00    01/08/2020 01:00
01/08/2020 00:00    C   1   ….. …
01/08/2020 00:00    D   1   ….  ….
01/08/2020 01:00    A   3   01/08/2020 00:00    01/08/2020 03:00
01/08/2020 01:00    B   2   ….  ….
01/08/2020 01:00    E   2   ….  ….
01/08/2020 01:00    F   1   ….  ….
01/08/2020 02:00    A   3   01/08/2020 00:00    01/08/2020 03:00
01/08/2020 02:00    E   2   ….  ….
01/08/2020 03:00    C   1   ….  ….
01/08/2020 03:00    G   1   ….  ….
 …..            ….  ….
01/08/2020 04:00    x   1   01/08/2020 04:00    01/08/2020 04:00
01/08/2020 04:00    s   1   ….  ….
            ….  ….
 …..            ….  ….
            ….  ….
01/08/2020 23:00    G   2   ….  ….
02/08/2020 00:00    G   2   ….  ….

Thank you!

回答1:

Start from defining 2 functions:

def cnt(grp):
    hr = grp.Hour
    return grp.assign(count=hr.size, period_start=hr.iloc[0], period_end=hr.iloc[-1])

def fn(grp):
    gr = grp.groupby((grp.Hour - grp.Hour.shift()).gt(pd.Timedelta('1H')).cumsum())
    return gr.apply(cnt)

Then group and apply it:

df.groupby('Site').apply(fn).reset_index(level=[0, 1], drop=True).sort_index()

You should start reading of the code from the end.

The first step is to group by Site (the first level of grouping) and apply fn to each group. For the time being skip the rest of this instruction.

Then fn function performs the second level grouping. The idea is to divide the source (first level) group into groups of rows for consecutive hours.

To each (second level) group cnt function is applied. Its result is the source group with added count, period_start and period_end columns.

And now there is time to look at the (skipped) part of the first instruction. The groupby(...).apply(...) part generates the following result (for brevity I included only result for Site == A and B.

                            Hour Site  count        period_start           period_end
Site Hour                                                                            
A    0    0  2020-08-01 00:00:00    A      3 2020-08-01 00:00:00  2020-08-01 02:00:00
          4  2020-08-01 01:00:00    A      3 2020-08-01 00:00:00  2020-08-01 02:00:00
          8  2020-08-01 02:00:00    A      3 2020-08-01 00:00:00  2020-08-01 02:00:00
     1    12 2020-08-01 04:00:00    A      2 2020-08-01 04:00:00  2020-08-01 05:00:00
          14 2020-08-01 05:00:00    A      2 2020-08-01 04:00:00  2020-08-01 05:00:00
     2    15 2020-08-01 08:00:00    A      1 2020-08-01 08:00:00  2020-08-01 08:00:00
B    0    1  2020-08-01 00:00:00    B      2 2020-08-01 00:00:00  2020-08-01 01:00:00
          5  2020-08-01 01:00:00    B      2 2020-08-01 00:00:00  2020-08-01 01:00:00

To get the final result, there is a need to:

reset_index(...) - drop the first 2 levels of the index.
sort_index() - sort rows by index.

The result is just as you expected.

回答2:

Initial DataFrame

print(df)
                  Hour Site
0  2020-01-08 00:00:00    A
1  2020-01-08 00:00:00    B
2  2020-01-08 00:00:00    C
3  2020-01-08 00:00:00    D
4  2020-01-08 01:00:00    A
5  2020-01-08 01:00:00    B
6  2020-01-08 01:00:00    E
7  2020-01-08 01:00:00    F
8  2020-01-08 02:00:00    A
9  2020-01-08 02:00:00    E
10 2020-01-08 03:00:00    C
11 2020-01-08 03:00:00    G
12 2020-01-08 04:00:00    X
13 2020-01-08 04:00:00    s
14 2020-01-08 23:00:00    G
15 2020-02-08 00:00:00    G

My approach

#if it is necesary
#df['Hour']=pd.to_datetime(df['Hour'])
#df=df.sort_values('Hour')

g=( df.groupby('Site')['Hour'].diff().ne(pd.Timedelta(hours=1))
      .groupby(df['Site']).cumsum() )

groups = df.groupby(['Site',g])['Hour']
new_df = df.assign(count = groups.transform('size'),
                   Period_start = groups.transform('first'),
                   Period_end = groups.transform('last'))

print(new_df)

Output

                  Hour Site  count        Period_start          Period_end
0  2020-01-08 00:00:00    A      3 2020-01-08 00:00:00 2020-01-08 02:00:00
1  2020-01-08 00:00:00    B      2 2020-01-08 00:00:00 2020-01-08 01:00:00
2  2020-01-08 00:00:00    C      1 2020-01-08 00:00:00 2020-01-08 00:00:00
3  2020-01-08 00:00:00    D      1 2020-01-08 00:00:00 2020-01-08 00:00:00
4  2020-01-08 01:00:00    A      3 2020-01-08 00:00:00 2020-01-08 02:00:00
5  2020-01-08 01:00:00    B      2 2020-01-08 00:00:00 2020-01-08 01:00:00
6  2020-01-08 01:00:00    E      2 2020-01-08 01:00:00 2020-01-08 02:00:00
7  2020-01-08 01:00:00    F      1 2020-01-08 01:00:00 2020-01-08 01:00:00
8  2020-01-08 02:00:00    A      3 2020-01-08 00:00:00 2020-01-08 02:00:00
9  2020-01-08 02:00:00    E      2 2020-01-08 01:00:00 2020-01-08 02:00:00
10 2020-01-08 03:00:00    C      1 2020-01-08 03:00:00 2020-01-08 03:00:00
11 2020-01-08 03:00:00    G      1 2020-01-08 03:00:00 2020-01-08 03:00:00
12 2020-01-08 04:00:00    X      1 2020-01-08 04:00:00 2020-01-08 04:00:00
13 2020-01-08 04:00:00    s      1 2020-01-08 04:00:00 2020-01-08 04:00:00
14 2020-01-08 23:00:00    G      1 2020-01-08 23:00:00 2020-01-08 23:00:00
15 2020-02-08 00:00:00    G      1 2020-02-08 00:00:00 2020-02-08 00:00:00

If you want mask if count equal 1

#if it is necesary
#df['Hour']=pd.to_datetime(df['Hour'])
#df=df.sort_values('Hour')

g=( df.groupby('Site')['Hour'].diff().ne(pd.Timedelta(hours=1))
      .groupby(df['Site']).cumsum() )

groups = df.groupby(['Site',g])['Hour']
new_df =( df.assign(count = groups.transform('size'))
            .assign(Period_start = lambda x: groups.transform('first')
                                                   .where(x['count'].gt(1)),
                   Period_end = lambda x: groups.transform('last')
                                                .where(x['count'].gt(1))) )
print(new_df)

Output

                  Hour Site  count        Period_start          Period_end
0  2020-01-08 00:00:00    A      3 2020-01-08 00:00:00 2020-01-08 02:00:00
1  2020-01-08 00:00:00    B      2 2020-01-08 00:00:00 2020-01-08 01:00:00
2  2020-01-08 00:00:00    C      1                 NaT                 NaT
3  2020-01-08 00:00:00    D      1                 NaT                 NaT
4  2020-01-08 01:00:00    A      3 2020-01-08 00:00:00 2020-01-08 02:00:00
5  2020-01-08 01:00:00    B      2 2020-01-08 00:00:00 2020-01-08 01:00:00
6  2020-01-08 01:00:00    E      2 2020-01-08 01:00:00 2020-01-08 02:00:00
7  2020-01-08 01:00:00    F      1                 NaT                 NaT
8  2020-01-08 02:00:00    A      3 2020-01-08 00:00:00 2020-01-08 02:00:00
9  2020-01-08 02:00:00    E      2 2020-01-08 01:00:00 2020-01-08 02:00:00
10 2020-01-08 03:00:00    C      1                 NaT                 NaT
11 2020-01-08 03:00:00    G      1                 NaT                 NaT
12 2020-01-08 04:00:00    X      1                 NaT                 NaT
13 2020-01-08 04:00:00    s      1                 NaT                 NaT
14 2020-01-08 23:00:00    G      1                 NaT                 NaT
15 2020-02-08 00:00:00    G      1                 NaT                 NaT

来源：https://stackoverflow.com/questions/59810506/count-of-a-value-in-consecutive-timestamp-in-pandas

标签

python

pandas

data-analysis