问题
I have 2 sets of Dataframe, both with an unique identifier and a datetime data in the format as such
"2020-01-01 00:00:01"-datetime and "12345" - unique identifier and Type
1st Question, DF1:
DatetimeX ID Type
2020-01-01 02:00:01 12345 C
2020-01-01 02:00:03 12345 C
2020-01-01 05:00:03 12345 C
2020-01-01 05:03:05 12345 C
2020-01-01 03:00:09 13333 D
2020-01-01 02:00:09 12345 C
2020-01-01 02:01:35 12345 C
2020-01-01 02:10:35 12345 C
2020-01-01 02:00:01 13333 D
2020-01-01 02:05:35 13333 D
2020-01-01 02:00:50 13333 E
2020-01-01 02:00:01 12211 C
2020-01-01 02:09:50 13333 E
2020-01-01 02:11:50 13333 E
I would like to based on the ID's 1st time stamp with the same "Type", and remove the rows 10mins after as such:
DatetimeX ID Type
2020-01-01 02:00:01 12345 C
2020-01-01 05:00:03 12345 C
2020-01-01 02:10:35 12345 C
2020-01-01 03:00:09 13333 D
2020-01-01 02:00:01 13333 D
2020-01-01 02:00:50 13333 E
2020-01-01 02:00:01 12211 C
2020-01-01 02:11:50 13333 E
I've tried to explore timerange/daterange but could not find any similar concept of coding. Would hope that if anyone can point out what kind of ways i can look into to explore and not trying to get a full solution. Have not touch python for a few years and not familiar with it previously. Thank you
Updated with additional data row for more accurate example
回答1:
IIUC you should try groupby
:
>>> df.groupby((df.Type != df.Type.shift()).cumsum(), as_index=False).first()
DatetimeX ID Type
0 2020-01-01 02:00:01 12345 C
1 2020-01-01 02:00:01 13333 D
2 2020-01-01 02:00:50 13333 E
3 2020-01-01 02:00:01 12211 C
>>>
It groups by consencutive same values.
回答2:
Based on your statement I would like to based on the ID's 1st time stamp with the same "Type", and remove the rows 10mins, I believe you can use groupby().transform()
to identify the first timestamps, then use boolean masking:
# also transform('min')
first_timestamps = df.groupby(['ID','Type'])['DatetimeX'].transform('first')
mask = df['DatetimeX'] - first_timestamps < pd.Timedelta('10Min')
df[mask]
However, since your sample data all have time within 10 mins from each other, this wouldn't cut anything from it.
Instead, if we change 10Min
to 1S
in the second line above, we have the expected output:
DatetimeX ID Type
0 2020-01-01 02:00:01 12345 C
4 2020-01-01 02:00:01 13333 D
6 2020-01-01 02:00:50 13333 E
7 2020-01-01 02:00:01 12211 C
回答3:
Add sample input data and simplfied the process:
Timestamp = pd.to_datetime
data = [{'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 12345, 'Type': 'C'},
{'DatetimeX': Timestamp('2020-01-01 02:00:03'), 'ID': 12345, 'Type': 'C'},
{'DatetimeX': Timestamp('2020-01-01 05:00:03'), 'ID': 12345, 'Type': 'C'},
{'DatetimeX': Timestamp('2020-01-01 05:03:05'), 'ID': 12345, 'Type': 'C'},
{'DatetimeX': Timestamp('2020-01-01 03:00:09'), 'ID': 13333, 'Type': 'D'},
{'DatetimeX': Timestamp('2020-01-01 02:00:09'), 'ID': 12345, 'Type': 'C'},
{'DatetimeX': Timestamp('2020-01-01 02:01:35'), 'ID': 12345, 'Type': 'C'},
{'DatetimeX': Timestamp('2020-01-01 02:10:35'), 'ID': 12345, 'Type': 'C'},
{'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 13333, 'Type': 'D'},
{'DatetimeX': Timestamp('2020-01-01 02:05:35'), 'ID': 13333, 'Type': 'D'},
{'DatetimeX': Timestamp('2020-01-01 02:00:50'), 'ID': 13333, 'Type': 'E'},
{'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 12211, 'Type': 'C'},
{'DatetimeX': Timestamp('2020-01-01 02:09:50'), 'ID': 13333, 'Type': 'E'},
{'DatetimeX': Timestamp('2020-01-01 02:11:50'), 'ID': 13333, 'Type': 'E'}]
df1 = pd.DataFrame(data)
col_raw = df1.columns
while True:
df1.sort_values(['ID', 'Type', 'DatetimeX'], inplace=True)
df1['diff1_lt10min'] = df1.groupby(['ID', 'Type'])['DatetimeX'].diff().dt.seconds < 10 * 60
df1['tag_group'] = (~df1['diff1_lt10min']).cumsum()
if df1.duplicated('tag_group').sum()==0:
break
df1 = df1.merge((df1.groupby('tag_group')['DatetimeX'].first()
.reset_index()
.rename(columns={'DatetimeX':'DatetimeX_1st'})),
on='tag_group')
df1['diff2_lt10min'] = (df1.DatetimeX - df1.DatetimeX_1st).dt.seconds < 10 * 60
cond = df1['diff1_lt10min'] & df1['diff2_lt10min']
df1 = df1.loc[~cond, col_raw]
df1 = df1[col_raw]
Detail...
# repeat
col_raw = df1.columns
df4 = df1.copy()
n_round = 1
while True:
print('#'*20, f'round {n_round}', '#'*20)
# step 1 sort the values & group by ['Type', 'ID'] calculate the DatetimeX's time diff
# notice: the time-diff is not the actual wanted
df = df4[col_raw].copy()
df.sort_values(['ID', 'Type', 'DatetimeX'], inplace=True)
df['diff'] = df.groupby(['Type', 'ID'])['DatetimeX'].diff()
print('#'*10, 'step1', '#'*10)
print(df)
# step 2, create a tag column to store the first 10min gap from 'diff' column
cond = False
cond |= df['diff'].dt.seconds > 10 * 60
cond |= df['diff'].isnull()
df['tag'] = np.where(cond, 1, 0)
df['tag'] = df['tag'].cumsum().fillna(method = 'ffill')
print('#'*10, 'step2', '#'*10)
print(df)
# step 3, use 'tag' to judge to stop the while loop or not
# tag should be unique
break_sign = df.tag.duplicated().sum()
if break_sign == 0:
break
print('#'*10, 'step3', '#'*10)
print(break_sign)
# step 4:
# create a 'DatetimeX_1st' with the 'tag' group's first DatetimeX
# create a 'diff2' = 'DatetimeX' - 'DatetimeX_1st'
df2 = df.reset_index().set_index('tag')
df2['DatetimeX_1st'] = df.groupby('tag').first()['DatetimeX']
df2['diff2'] = df2['DatetimeX'] - df2['DatetimeX_1st']
print('#'*10, 'step4', '#'*10)
print(df2)
# step 5:
# drop the True < 10min gaps records
# 'diff' and 'diff2' should all < 10min
cond = (df2['diff2'].dt.seconds < 10 * 60) & (df2['diff'].dt.seconds < 10 * 60)
df3 = df2[~cond].copy()
print('#'*10, 'step5', '#'*10)
print(df3)
# step 6:
# reset index
cols = 'tag DatetimeX ID Type'.split()
df4 = df3.reset_index().set_index('index').sort_index()[cols]
print('#'*10, 'step6', '#'*10)
print(df4)
n_round += 1
print()
# get result
result = df[['DatetimeX', 'ID', 'Type']].copy()
result.index.name = None
print()
print('#'*10, 'result', '#'*10)
print(result)
output:
#################### round 1 ####################
########## step1 ##########
DatetimeX ID Type diff
11 2020-01-01 02:00:01 12211 C NaT
0 2020-01-01 02:00:01 12345 C NaT
1 2020-01-01 02:00:03 12345 C 0 days 00:00:02
5 2020-01-01 02:00:09 12345 C 0 days 00:00:06
6 2020-01-01 02:01:35 12345 C 0 days 00:01:26
7 2020-01-01 02:10:35 12345 C 0 days 00:09:00
2 2020-01-01 05:00:03 12345 C 0 days 02:49:28
3 2020-01-01 05:03:05 12345 C 0 days 00:03:02
8 2020-01-01 02:00:01 13333 D NaT
9 2020-01-01 02:05:35 13333 D 0 days 00:05:34
4 2020-01-01 03:00:09 13333 D 0 days 00:54:34
10 2020-01-01 02:00:50 13333 E NaT
12 2020-01-01 02:09:50 13333 E 0 days 00:09:00
13 2020-01-01 02:11:50 13333 E 0 days 00:02:00
########## step2 ##########
DatetimeX ID Type diff tag
11 2020-01-01 02:00:01 12211 C NaT 1
0 2020-01-01 02:00:01 12345 C NaT 2
1 2020-01-01 02:00:03 12345 C 0 days 00:00:02 2
5 2020-01-01 02:00:09 12345 C 0 days 00:00:06 2
6 2020-01-01 02:01:35 12345 C 0 days 00:01:26 2
7 2020-01-01 02:10:35 12345 C 0 days 00:09:00 2
2 2020-01-01 05:00:03 12345 C 0 days 02:49:28 3
3 2020-01-01 05:03:05 12345 C 0 days 00:03:02 3
8 2020-01-01 02:00:01 13333 D NaT 4
9 2020-01-01 02:05:35 13333 D 0 days 00:05:34 4
4 2020-01-01 03:00:09 13333 D 0 days 00:54:34 5
10 2020-01-01 02:00:50 13333 E NaT 6
12 2020-01-01 02:09:50 13333 E 0 days 00:09:00 6
13 2020-01-01 02:11:50 13333 E 0 days 00:02:00 6
########## step3 ##########
8
########## step4 ##########
index DatetimeX ID Type diff \
tag
1 11 2020-01-01 02:00:01 12211 C NaT
2 0 2020-01-01 02:00:01 12345 C NaT
2 1 2020-01-01 02:00:03 12345 C 0 days 00:00:02
2 5 2020-01-01 02:00:09 12345 C 0 days 00:00:06
2 6 2020-01-01 02:01:35 12345 C 0 days 00:01:26
2 7 2020-01-01 02:10:35 12345 C 0 days 00:09:00
3 2 2020-01-01 05:00:03 12345 C 0 days 02:49:28
3 3 2020-01-01 05:03:05 12345 C 0 days 00:03:02
4 8 2020-01-01 02:00:01 13333 D NaT
4 9 2020-01-01 02:05:35 13333 D 0 days 00:05:34
5 4 2020-01-01 03:00:09 13333 D 0 days 00:54:34
6 10 2020-01-01 02:00:50 13333 E NaT
6 12 2020-01-01 02:09:50 13333 E 0 days 00:09:00
6 13 2020-01-01 02:11:50 13333 E 0 days 00:02:00
DatetimeX_1st diff2
tag
1 2020-01-01 02:00:01 0 days 00:00:00
2 2020-01-01 02:00:01 0 days 00:00:00
2 2020-01-01 02:00:01 0 days 00:00:02
2 2020-01-01 02:00:01 0 days 00:00:08
2 2020-01-01 02:00:01 0 days 00:01:34
2 2020-01-01 02:00:01 0 days 00:10:34
3 2020-01-01 05:00:03 0 days 00:00:00
3 2020-01-01 05:00:03 0 days 00:03:02
4 2020-01-01 02:00:01 0 days 00:00:00
4 2020-01-01 02:00:01 0 days 00:05:34
5 2020-01-01 03:00:09 0 days 00:00:00
6 2020-01-01 02:00:50 0 days 00:00:00
6 2020-01-01 02:00:50 0 days 00:09:00
6 2020-01-01 02:00:50 0 days 00:11:00
########## step5 ##########
index DatetimeX ID Type diff \
tag
1 11 2020-01-01 02:00:01 12211 C NaT
2 0 2020-01-01 02:00:01 12345 C NaT
2 7 2020-01-01 02:10:35 12345 C 0 days 00:09:00
3 2 2020-01-01 05:00:03 12345 C 0 days 02:49:28
4 8 2020-01-01 02:00:01 13333 D NaT
5 4 2020-01-01 03:00:09 13333 D 0 days 00:54:34
6 10 2020-01-01 02:00:50 13333 E NaT
6 13 2020-01-01 02:11:50 13333 E 0 days 00:02:00
DatetimeX_1st diff2
tag
1 2020-01-01 02:00:01 0 days 00:00:00
2 2020-01-01 02:00:01 0 days 00:00:00
2 2020-01-01 02:00:01 0 days 00:10:34
3 2020-01-01 05:00:03 0 days 00:00:00
4 2020-01-01 02:00:01 0 days 00:00:00
5 2020-01-01 03:00:09 0 days 00:00:00
6 2020-01-01 02:00:50 0 days 00:00:00
6 2020-01-01 02:00:50 0 days 00:11:00
########## step6 ##########
tag DatetimeX ID Type
index
0 2 2020-01-01 02:00:01 12345 C
2 3 2020-01-01 05:00:03 12345 C
4 5 2020-01-01 03:00:09 13333 D
7 2 2020-01-01 02:10:35 12345 C
8 4 2020-01-01 02:00:01 13333 D
10 6 2020-01-01 02:00:50 13333 E
11 1 2020-01-01 02:00:01 12211 C
13 6 2020-01-01 02:11:50 13333 E
#################### round 2 ####################
########## step1 ##########
DatetimeX ID Type diff
index
11 2020-01-01 02:00:01 12211 C NaT
0 2020-01-01 02:00:01 12345 C NaT
7 2020-01-01 02:10:35 12345 C 0 days 00:10:34
2 2020-01-01 05:00:03 12345 C 0 days 02:49:28
8 2020-01-01 02:00:01 13333 D NaT
4 2020-01-01 03:00:09 13333 D 0 days 01:00:08
10 2020-01-01 02:00:50 13333 E NaT
13 2020-01-01 02:11:50 13333 E 0 days 00:11:00
########## step2 ##########
DatetimeX ID Type diff tag
index
11 2020-01-01 02:00:01 12211 C NaT 1
0 2020-01-01 02:00:01 12345 C NaT 2
7 2020-01-01 02:10:35 12345 C 0 days 00:10:34 3
2 2020-01-01 05:00:03 12345 C 0 days 02:49:28 4
8 2020-01-01 02:00:01 13333 D NaT 5
4 2020-01-01 03:00:09 13333 D 0 days 01:00:08 6
10 2020-01-01 02:00:50 13333 E NaT 7
13 2020-01-01 02:11:50 13333 E 0 days 00:11:00 8
########## result ##########
DatetimeX ID Type
11 2020-01-01 02:00:01 12211 C
0 2020-01-01 02:00:01 12345 C
7 2020-01-01 02:10:35 12345 C
2 2020-01-01 05:00:03 12345 C
8 2020-01-01 02:00:01 13333 D
4 2020-01-01 03:00:09 13333 D
10 2020-01-01 02:00:50 13333 E
13 2020-01-01 02:11:50 13333 E
来源:https://stackoverflow.com/questions/65821366/python-removing-rows-with-time-condition