Python removing rows with time condition

别来无恙 提交于 2021-01-24 07:06:10

问题


I have 2 sets of Dataframe, both with an unique identifier and a datetime data in the format as such

"2020-01-01 00:00:01"-datetime and "12345" - unique identifier and Type

1st Question, DF1:

   DatetimeX            ID    Type
   2020-01-01 02:00:01 12345 C
   2020-01-01 02:00:03 12345 C
   2020-01-01 05:00:03 12345 C
   2020-01-01 05:03:05 12345 C
   2020-01-01 03:00:09 13333 D
   2020-01-01 02:00:09 12345 C
   2020-01-01 02:01:35 12345 C
   2020-01-01 02:10:35 12345 C
   2020-01-01 02:00:01 13333 D
   2020-01-01 02:05:35 13333 D
   2020-01-01 02:00:50 13333 E
   2020-01-01 02:00:01 12211 C
   2020-01-01 02:09:50 13333 E
   2020-01-01 02:11:50 13333 E

I would like to based on the ID's 1st time stamp with the same "Type", and remove the rows 10mins after as such:

   DatetimeX            ID    Type
   2020-01-01 02:00:01 12345 C
   2020-01-01 05:00:03 12345 C
   2020-01-01 02:10:35 12345 C
   2020-01-01 03:00:09 13333 D
   2020-01-01 02:00:01 13333 D
   2020-01-01 02:00:50 13333 E
   2020-01-01 02:00:01 12211 C
   2020-01-01 02:11:50 13333 E

I've tried to explore timerange/daterange but could not find any similar concept of coding. Would hope that if anyone can point out what kind of ways i can look into to explore and not trying to get a full solution. Have not touch python for a few years and not familiar with it previously. Thank you

Updated with additional data row for more accurate example


回答1:


IIUC you should try groupby:

>>> df.groupby((df.Type != df.Type.shift()).cumsum(), as_index=False).first()
            DatetimeX     ID Type
0 2020-01-01 02:00:01  12345    C
1 2020-01-01 02:00:01  13333    D
2 2020-01-01 02:00:50  13333    E
3 2020-01-01 02:00:01  12211    C
>>> 

It groups by consencutive same values.




回答2:


Based on your statement I would like to based on the ID's 1st time stamp with the same "Type", and remove the rows 10mins, I believe you can use groupby().transform() to identify the first timestamps, then use boolean masking:

# also transform('min')
first_timestamps = df.groupby(['ID','Type'])['DatetimeX'].transform('first')

mask = df['DatetimeX'] - first_timestamps < pd.Timedelta('10Min')

df[mask]

However, since your sample data all have time within 10 mins from each other, this wouldn't cut anything from it.

Instead, if we change 10Min to 1S in the second line above, we have the expected output:

            DatetimeX     ID Type
0 2020-01-01 02:00:01  12345    C
4 2020-01-01 02:00:01  13333    D
6 2020-01-01 02:00:50  13333    E
7 2020-01-01 02:00:01  12211    C



回答3:


Add sample input data and simplfied the process:

Timestamp = pd.to_datetime
data = [{'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 02:00:03'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 05:00:03'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 05:03:05'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 03:00:09'), 'ID': 13333, 'Type': 'D'},
 {'DatetimeX': Timestamp('2020-01-01 02:00:09'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 02:01:35'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 02:10:35'), 'ID': 12345, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 13333, 'Type': 'D'},
 {'DatetimeX': Timestamp('2020-01-01 02:05:35'), 'ID': 13333, 'Type': 'D'},
 {'DatetimeX': Timestamp('2020-01-01 02:00:50'), 'ID': 13333, 'Type': 'E'},
 {'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 12211, 'Type': 'C'},
 {'DatetimeX': Timestamp('2020-01-01 02:09:50'), 'ID': 13333, 'Type': 'E'},
 {'DatetimeX': Timestamp('2020-01-01 02:11:50'), 'ID': 13333, 'Type': 'E'}]
df1 = pd.DataFrame(data)


col_raw = df1.columns
while True:
    df1.sort_values(['ID', 'Type', 'DatetimeX'], inplace=True)
    df1['diff1_lt10min'] = df1.groupby(['ID', 'Type'])['DatetimeX'].diff().dt.seconds < 10 * 60
    df1['tag_group'] = (~df1['diff1_lt10min']).cumsum()
    if df1.duplicated('tag_group').sum()==0:
        break
    df1 = df1.merge((df1.groupby('tag_group')['DatetimeX'].first()
               .reset_index()
               .rename(columns={'DatetimeX':'DatetimeX_1st'})),
              on='tag_group')
    df1['diff2_lt10min'] = (df1.DatetimeX - df1.DatetimeX_1st).dt.seconds < 10 * 60
    cond = df1['diff1_lt10min'] & df1['diff2_lt10min']
    df1 = df1.loc[~cond, col_raw]
df1 = df1[col_raw]

Detail...

# repeat
col_raw = df1.columns
df4 = df1.copy()
n_round = 1
while True:
    print('#'*20, f'round {n_round}', '#'*20)
    # step 1 sort the values & group by ['Type', 'ID'] calculate the DatetimeX's time diff
    # notice: the time-diff is not the actual wanted
    df = df4[col_raw].copy()
    df.sort_values(['ID', 'Type', 'DatetimeX'], inplace=True)
    df['diff'] = df.groupby(['Type', 'ID'])['DatetimeX'].diff()
    print('#'*10, 'step1', '#'*10)
    print(df)

    # step 2, create a tag column to store the first 10min gap from 'diff' column
    cond = False 
    cond |= df['diff'].dt.seconds > 10 * 60
    cond |= df['diff'].isnull()
    df['tag'] = np.where(cond, 1, 0)
    df['tag'] = df['tag'].cumsum().fillna(method = 'ffill')
    print('#'*10, 'step2', '#'*10)
    print(df)

    # step 3, use 'tag' to judge to stop the while loop or not
    # tag should be unique
    break_sign = df.tag.duplicated().sum()
    if break_sign == 0:
        break
    print('#'*10, 'step3', '#'*10)
    print(break_sign)
    
    # step 4:
        # create a 'DatetimeX_1st' with the 'tag' group's first DatetimeX
        # create a 'diff2' = 'DatetimeX' - 'DatetimeX_1st'
    df2 = df.reset_index().set_index('tag')
    df2['DatetimeX_1st'] = df.groupby('tag').first()['DatetimeX']
    df2['diff2'] = df2['DatetimeX'] - df2['DatetimeX_1st']
    print('#'*10, 'step4', '#'*10)
    print(df2)
    
    # step 5:
        # drop the True < 10min gaps records
        # 'diff' and 'diff2' should all < 10min
    cond = (df2['diff2'].dt.seconds < 10 * 60) & (df2['diff'].dt.seconds < 10 * 60)
    df3 = df2[~cond].copy()
    print('#'*10, 'step5', '#'*10)
    print(df3)
    
    
    # step 6:
        # reset index
    cols = 'tag DatetimeX   ID  Type'.split()
    df4 = df3.reset_index().set_index('index').sort_index()[cols]
    print('#'*10, 'step6', '#'*10)
    print(df4)
    
    n_round += 1
    print()
    
# get result
result = df[['DatetimeX', 'ID', 'Type']].copy()
result.index.name = None
print()
print('#'*10, 'result', '#'*10)
print(result)

output:

#################### round 1 ####################
########## step1 ##########
             DatetimeX     ID Type            diff
11 2020-01-01 02:00:01  12211    C             NaT
0  2020-01-01 02:00:01  12345    C             NaT
1  2020-01-01 02:00:03  12345    C 0 days 00:00:02
5  2020-01-01 02:00:09  12345    C 0 days 00:00:06
6  2020-01-01 02:01:35  12345    C 0 days 00:01:26
7  2020-01-01 02:10:35  12345    C 0 days 00:09:00
2  2020-01-01 05:00:03  12345    C 0 days 02:49:28
3  2020-01-01 05:03:05  12345    C 0 days 00:03:02
8  2020-01-01 02:00:01  13333    D             NaT
9  2020-01-01 02:05:35  13333    D 0 days 00:05:34
4  2020-01-01 03:00:09  13333    D 0 days 00:54:34
10 2020-01-01 02:00:50  13333    E             NaT
12 2020-01-01 02:09:50  13333    E 0 days 00:09:00
13 2020-01-01 02:11:50  13333    E 0 days 00:02:00
########## step2 ##########
             DatetimeX     ID Type            diff  tag
11 2020-01-01 02:00:01  12211    C             NaT    1
0  2020-01-01 02:00:01  12345    C             NaT    2
1  2020-01-01 02:00:03  12345    C 0 days 00:00:02    2
5  2020-01-01 02:00:09  12345    C 0 days 00:00:06    2
6  2020-01-01 02:01:35  12345    C 0 days 00:01:26    2
7  2020-01-01 02:10:35  12345    C 0 days 00:09:00    2
2  2020-01-01 05:00:03  12345    C 0 days 02:49:28    3
3  2020-01-01 05:03:05  12345    C 0 days 00:03:02    3
8  2020-01-01 02:00:01  13333    D             NaT    4
9  2020-01-01 02:05:35  13333    D 0 days 00:05:34    4
4  2020-01-01 03:00:09  13333    D 0 days 00:54:34    5
10 2020-01-01 02:00:50  13333    E             NaT    6
12 2020-01-01 02:09:50  13333    E 0 days 00:09:00    6
13 2020-01-01 02:11:50  13333    E 0 days 00:02:00    6
########## step3 ##########
8
########## step4 ##########
     index           DatetimeX     ID Type            diff  \
tag                                                          
1       11 2020-01-01 02:00:01  12211    C             NaT   
2        0 2020-01-01 02:00:01  12345    C             NaT   
2        1 2020-01-01 02:00:03  12345    C 0 days 00:00:02   
2        5 2020-01-01 02:00:09  12345    C 0 days 00:00:06   
2        6 2020-01-01 02:01:35  12345    C 0 days 00:01:26   
2        7 2020-01-01 02:10:35  12345    C 0 days 00:09:00   
3        2 2020-01-01 05:00:03  12345    C 0 days 02:49:28   
3        3 2020-01-01 05:03:05  12345    C 0 days 00:03:02   
4        8 2020-01-01 02:00:01  13333    D             NaT   
4        9 2020-01-01 02:05:35  13333    D 0 days 00:05:34   
5        4 2020-01-01 03:00:09  13333    D 0 days 00:54:34   
6       10 2020-01-01 02:00:50  13333    E             NaT   
6       12 2020-01-01 02:09:50  13333    E 0 days 00:09:00   
6       13 2020-01-01 02:11:50  13333    E 0 days 00:02:00   

          DatetimeX_1st           diff2  
tag                                      
1   2020-01-01 02:00:01 0 days 00:00:00  
2   2020-01-01 02:00:01 0 days 00:00:00  
2   2020-01-01 02:00:01 0 days 00:00:02  
2   2020-01-01 02:00:01 0 days 00:00:08  
2   2020-01-01 02:00:01 0 days 00:01:34  
2   2020-01-01 02:00:01 0 days 00:10:34  
3   2020-01-01 05:00:03 0 days 00:00:00  
3   2020-01-01 05:00:03 0 days 00:03:02  
4   2020-01-01 02:00:01 0 days 00:00:00  
4   2020-01-01 02:00:01 0 days 00:05:34  
5   2020-01-01 03:00:09 0 days 00:00:00  
6   2020-01-01 02:00:50 0 days 00:00:00  
6   2020-01-01 02:00:50 0 days 00:09:00  
6   2020-01-01 02:00:50 0 days 00:11:00  
########## step5 ##########
     index           DatetimeX     ID Type            diff  \
tag                                                          
1       11 2020-01-01 02:00:01  12211    C             NaT   
2        0 2020-01-01 02:00:01  12345    C             NaT   
2        7 2020-01-01 02:10:35  12345    C 0 days 00:09:00   
3        2 2020-01-01 05:00:03  12345    C 0 days 02:49:28   
4        8 2020-01-01 02:00:01  13333    D             NaT   
5        4 2020-01-01 03:00:09  13333    D 0 days 00:54:34   
6       10 2020-01-01 02:00:50  13333    E             NaT   
6       13 2020-01-01 02:11:50  13333    E 0 days 00:02:00   

          DatetimeX_1st           diff2  
tag                                      
1   2020-01-01 02:00:01 0 days 00:00:00  
2   2020-01-01 02:00:01 0 days 00:00:00  
2   2020-01-01 02:00:01 0 days 00:10:34  
3   2020-01-01 05:00:03 0 days 00:00:00  
4   2020-01-01 02:00:01 0 days 00:00:00  
5   2020-01-01 03:00:09 0 days 00:00:00  
6   2020-01-01 02:00:50 0 days 00:00:00  
6   2020-01-01 02:00:50 0 days 00:11:00  
########## step6 ##########
       tag           DatetimeX     ID Type
index                                     
0        2 2020-01-01 02:00:01  12345    C
2        3 2020-01-01 05:00:03  12345    C
4        5 2020-01-01 03:00:09  13333    D
7        2 2020-01-01 02:10:35  12345    C
8        4 2020-01-01 02:00:01  13333    D
10       6 2020-01-01 02:00:50  13333    E
11       1 2020-01-01 02:00:01  12211    C
13       6 2020-01-01 02:11:50  13333    E

#################### round 2 ####################
########## step1 ##########
                DatetimeX     ID Type            diff
index                                                
11    2020-01-01 02:00:01  12211    C             NaT
0     2020-01-01 02:00:01  12345    C             NaT
7     2020-01-01 02:10:35  12345    C 0 days 00:10:34
2     2020-01-01 05:00:03  12345    C 0 days 02:49:28
8     2020-01-01 02:00:01  13333    D             NaT
4     2020-01-01 03:00:09  13333    D 0 days 01:00:08
10    2020-01-01 02:00:50  13333    E             NaT
13    2020-01-01 02:11:50  13333    E 0 days 00:11:00
########## step2 ##########
                DatetimeX     ID Type            diff  tag
index                                                     
11    2020-01-01 02:00:01  12211    C             NaT    1
0     2020-01-01 02:00:01  12345    C             NaT    2
7     2020-01-01 02:10:35  12345    C 0 days 00:10:34    3
2     2020-01-01 05:00:03  12345    C 0 days 02:49:28    4
8     2020-01-01 02:00:01  13333    D             NaT    5
4     2020-01-01 03:00:09  13333    D 0 days 01:00:08    6
10    2020-01-01 02:00:50  13333    E             NaT    7
13    2020-01-01 02:11:50  13333    E 0 days 00:11:00    8

########## result ##########
             DatetimeX     ID Type
11 2020-01-01 02:00:01  12211    C
0  2020-01-01 02:00:01  12345    C
7  2020-01-01 02:10:35  12345    C
2  2020-01-01 05:00:03  12345    C
8  2020-01-01 02:00:01  13333    D
4  2020-01-01 03:00:09  13333    D
10 2020-01-01 02:00:50  13333    E
13 2020-01-01 02:11:50  13333    E


来源:https://stackoverflow.com/questions/65821366/python-removing-rows-with-time-condition

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!