问题
Hour Site
01/08/2020 00:00 A
01/08/2020 00:00 B
01/08/2020 00:00 C
01/08/2020 00:00 D
01/08/2020 01:00 A
01/08/2020 01:00 B
01/08/2020 01:00 E
01/08/2020 01:00 F
01/08/2020 02:00 A
01/08/2020 02:00 E
01/08/2020 03:00 C
01/08/2020 03:00 G
…..
01/08/2020 04:00 x
01/08/2020 04:00 s
…..
01/08/2020 23:00 G
02/08/2020 00:00 G
I have a dataframe like above. I want to count how many times a site comes in consecutive hours & start and end timestamp. wheres in each hour there are multiple sites. For example site A appears in in 3 consecutive timestamp, then again in one timestamp. I want an output like below, or in more effective format.
Hour Site count period_start Period_end
01/08/2020 00:00 A 3 01/08/2020 00:00 01/08/2020 03:00
01/08/2020 00:00 B 2 01/08/2020 00:00 01/08/2020 01:00
01/08/2020 00:00 C 1 ….. …
01/08/2020 00:00 D 1 …. ….
01/08/2020 01:00 A 3 01/08/2020 00:00 01/08/2020 03:00
01/08/2020 01:00 B 2 …. ….
01/08/2020 01:00 E 2 …. ….
01/08/2020 01:00 F 1 …. ….
01/08/2020 02:00 A 3 01/08/2020 00:00 01/08/2020 03:00
01/08/2020 02:00 E 2 …. ….
01/08/2020 03:00 C 1 …. ….
01/08/2020 03:00 G 1 …. ….
….. …. ….
01/08/2020 04:00 x 1 01/08/2020 04:00 01/08/2020 04:00
01/08/2020 04:00 s 1 …. ….
…. ….
….. …. ….
…. ….
01/08/2020 23:00 G 2 …. ….
02/08/2020 00:00 G 2 …. ….
Thank you!
回答1:
Start from defining 2 functions:
def cnt(grp):
hr = grp.Hour
return grp.assign(count=hr.size, period_start=hr.iloc[0], period_end=hr.iloc[-1])
def fn(grp):
gr = grp.groupby((grp.Hour - grp.Hour.shift()).gt(pd.Timedelta('1H')).cumsum())
return gr.apply(cnt)
Then group and apply it:
df.groupby('Site').apply(fn).reset_index(level=[0, 1], drop=True).sort_index()
You should start reading of the code from the end.
The first step is to group by Site (the first level of grouping) and apply fn to each group. For the time being skip the rest of this instruction.
Then fn function performs the second level grouping. The idea is to divide the source (first level) group into groups of rows for consecutive hours.
To each (second level) group cnt function is applied. Its result is the source group with added count, period_start and period_end columns.
And now there is time to look at the (skipped) part of the first instruction. The groupby(...).apply(...) part generates the following result (for brevity I included only result for Site == A and B.
Hour Site count period_start period_end
Site Hour
A 0 0 2020-08-01 00:00:00 A 3 2020-08-01 00:00:00 2020-08-01 02:00:00
4 2020-08-01 01:00:00 A 3 2020-08-01 00:00:00 2020-08-01 02:00:00
8 2020-08-01 02:00:00 A 3 2020-08-01 00:00:00 2020-08-01 02:00:00
1 12 2020-08-01 04:00:00 A 2 2020-08-01 04:00:00 2020-08-01 05:00:00
14 2020-08-01 05:00:00 A 2 2020-08-01 04:00:00 2020-08-01 05:00:00
2 15 2020-08-01 08:00:00 A 1 2020-08-01 08:00:00 2020-08-01 08:00:00
B 0 1 2020-08-01 00:00:00 B 2 2020-08-01 00:00:00 2020-08-01 01:00:00
5 2020-08-01 01:00:00 B 2 2020-08-01 00:00:00 2020-08-01 01:00:00
To get the final result, there is a need to:
- reset_index(...) - drop the first 2 levels of the index.
- sort_index() - sort rows by index.
The result is just as you expected.
回答2:
Initial DataFrame
print(df)
Hour Site
0 2020-01-08 00:00:00 A
1 2020-01-08 00:00:00 B
2 2020-01-08 00:00:00 C
3 2020-01-08 00:00:00 D
4 2020-01-08 01:00:00 A
5 2020-01-08 01:00:00 B
6 2020-01-08 01:00:00 E
7 2020-01-08 01:00:00 F
8 2020-01-08 02:00:00 A
9 2020-01-08 02:00:00 E
10 2020-01-08 03:00:00 C
11 2020-01-08 03:00:00 G
12 2020-01-08 04:00:00 X
13 2020-01-08 04:00:00 s
14 2020-01-08 23:00:00 G
15 2020-02-08 00:00:00 G
My approach
#if it is necesary
#df['Hour']=pd.to_datetime(df['Hour'])
#df=df.sort_values('Hour')
g=( df.groupby('Site')['Hour'].diff().ne(pd.Timedelta(hours=1))
.groupby(df['Site']).cumsum() )
groups = df.groupby(['Site',g])['Hour']
new_df = df.assign(count = groups.transform('size'),
Period_start = groups.transform('first'),
Period_end = groups.transform('last'))
print(new_df)
Output
Hour Site count Period_start Period_end
0 2020-01-08 00:00:00 A 3 2020-01-08 00:00:00 2020-01-08 02:00:00
1 2020-01-08 00:00:00 B 2 2020-01-08 00:00:00 2020-01-08 01:00:00
2 2020-01-08 00:00:00 C 1 2020-01-08 00:00:00 2020-01-08 00:00:00
3 2020-01-08 00:00:00 D 1 2020-01-08 00:00:00 2020-01-08 00:00:00
4 2020-01-08 01:00:00 A 3 2020-01-08 00:00:00 2020-01-08 02:00:00
5 2020-01-08 01:00:00 B 2 2020-01-08 00:00:00 2020-01-08 01:00:00
6 2020-01-08 01:00:00 E 2 2020-01-08 01:00:00 2020-01-08 02:00:00
7 2020-01-08 01:00:00 F 1 2020-01-08 01:00:00 2020-01-08 01:00:00
8 2020-01-08 02:00:00 A 3 2020-01-08 00:00:00 2020-01-08 02:00:00
9 2020-01-08 02:00:00 E 2 2020-01-08 01:00:00 2020-01-08 02:00:00
10 2020-01-08 03:00:00 C 1 2020-01-08 03:00:00 2020-01-08 03:00:00
11 2020-01-08 03:00:00 G 1 2020-01-08 03:00:00 2020-01-08 03:00:00
12 2020-01-08 04:00:00 X 1 2020-01-08 04:00:00 2020-01-08 04:00:00
13 2020-01-08 04:00:00 s 1 2020-01-08 04:00:00 2020-01-08 04:00:00
14 2020-01-08 23:00:00 G 1 2020-01-08 23:00:00 2020-01-08 23:00:00
15 2020-02-08 00:00:00 G 1 2020-02-08 00:00:00 2020-02-08 00:00:00
If you want mask if count equal 1
#if it is necesary
#df['Hour']=pd.to_datetime(df['Hour'])
#df=df.sort_values('Hour')
g=( df.groupby('Site')['Hour'].diff().ne(pd.Timedelta(hours=1))
.groupby(df['Site']).cumsum() )
groups = df.groupby(['Site',g])['Hour']
new_df =( df.assign(count = groups.transform('size'))
.assign(Period_start = lambda x: groups.transform('first')
.where(x['count'].gt(1)),
Period_end = lambda x: groups.transform('last')
.where(x['count'].gt(1))) )
print(new_df)
Output
Hour Site count Period_start Period_end
0 2020-01-08 00:00:00 A 3 2020-01-08 00:00:00 2020-01-08 02:00:00
1 2020-01-08 00:00:00 B 2 2020-01-08 00:00:00 2020-01-08 01:00:00
2 2020-01-08 00:00:00 C 1 NaT NaT
3 2020-01-08 00:00:00 D 1 NaT NaT
4 2020-01-08 01:00:00 A 3 2020-01-08 00:00:00 2020-01-08 02:00:00
5 2020-01-08 01:00:00 B 2 2020-01-08 00:00:00 2020-01-08 01:00:00
6 2020-01-08 01:00:00 E 2 2020-01-08 01:00:00 2020-01-08 02:00:00
7 2020-01-08 01:00:00 F 1 NaT NaT
8 2020-01-08 02:00:00 A 3 2020-01-08 00:00:00 2020-01-08 02:00:00
9 2020-01-08 02:00:00 E 2 2020-01-08 01:00:00 2020-01-08 02:00:00
10 2020-01-08 03:00:00 C 1 NaT NaT
11 2020-01-08 03:00:00 G 1 NaT NaT
12 2020-01-08 04:00:00 X 1 NaT NaT
13 2020-01-08 04:00:00 s 1 NaT NaT
14 2020-01-08 23:00:00 G 1 NaT NaT
15 2020-02-08 00:00:00 G 1 NaT NaT
来源:https://stackoverflow.com/questions/59810506/count-of-a-value-in-consecutive-timestamp-in-pandas