pandas merging based on a timestamp which do not match exactly

后端 未结 3 655
礼貌的吻别
礼貌的吻别 2020-11-30 04:12

What methods are available to merge columns which have timestamps that do not exactly match?

DF1:

date    start_time  employee_id session_id
01/01/20         


        
相关标签:
3条回答
  • 2020-11-30 04:53

    Consider the following mini-version of your problem:

    from io import StringIO
    from pandas import read_csv, to_datetime
    
    # how close do sessions have to be to be considered equal? (in minutes)
    threshold = 5
    
    # datetime column (combination of date + start_time)
    dtc = [['date', 'start_time']]
    
    # index column (above combination)
    ixc = 'date_start_time'
    
    df1 = read_csv(StringIO(u'''
    date,start_time,employee_id,session_id
    01/01/2016,02:03:00,7261824,871631182
    01/01/2016,06:03:00,7261824,871631183
    01/01/2016,11:01:00,7261824,871631184
    01/01/2016,14:01:00,7261824,871631185
    '''), parse_dates=dtc)
    
    df2 = read_csv(StringIO(u'''
    date,start_time,employee_id,session_id
    01/01/2016,02:03:00,7261824,871631182
    01/01/2016,06:05:00,7261824,871631183
    01/01/2016,11:04:00,7261824,871631184
    01/01/2016,14:10:00,7261824,871631185
    '''), parse_dates=dtc)
    

    which gives

    >>> df1
          date_start_time  employee_id  session_id
    0 2016-01-01 02:03:00      7261824   871631182
    1 2016-01-01 06:03:00      7261824   871631183
    2 2016-01-01 11:01:00      7261824   871631184
    3 2016-01-01 14:01:00      7261824   871631185
    >>> df2
          date_start_time  employee_id  session_id
    0 2016-01-01 02:03:00      7261824   871631182
    1 2016-01-01 06:05:00      7261824   871631183
    2 2016-01-01 11:04:00      7261824   871631184
    3 2016-01-01 14:10:00      7261824   871631185
    

    You would like to treat df2[0:3] as duplicates of df1[0:3] when merging (since they are respectively less than 5 minutes apart), but treat df1[3] and df2[3] as separate sessions.

    Solution 1: Interval Matching

    This is essentially what you are suggesting in your edit. You want to map timestamps in both tables to a 10-minute interval centered on the timestamp rounded to the nearest 5 minutes.

    Each interval can be represented uniquely by its midpoint, so you can merge the data frames on the timestamp rounded to the nearest 5 minutes. For example:

    import numpy as np
    
    # half-threshold in nanoseconds
    threshold_ns = threshold * 60 * 1e9
    
    # compute "interval" to which each session belongs
    df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
    df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
    
    # join
    cols = ['interval', 'employee_id', 'session_id']
    print df1.merge(df2, on=cols, how='outer')[cols]
    

    which prints

                 interval  employee_id  session_id
    0 2016-01-01 02:05:00      7261824   871631182
    1 2016-01-01 06:05:00      7261824   871631183
    2 2016-01-01 11:00:00      7261824   871631184
    3 2016-01-01 14:00:00      7261824   871631185
    4 2016-01-01 11:05:00      7261824   871631184
    5 2016-01-01 14:10:00      7261824   871631185
    

    Note that this is not totally correct. The sessions df1[2] and df2[2] are not treated as duplicates although they are only 3 minutes apart. This is because they were on different sides of the interval boundary.

    Solution 2: One-to-one matching

    Here is another approach which depends on the condition that sessions in df1 have either zero or one duplicates in df2.

    We replace timestamps in df1 with the closest timestamp in df2 which matches on employee_id and session_id and is less than 5 minutes away.

    from datetime import timedelta
    
    # get closest match from "df2" to row from "df1" (as long as it's below the threshold)
    def closest(row):
        matches = df2.loc[(df2.employee_id == row.employee_id) &
                          (df2.session_id == row.session_id)]
    
        deltas = matches.date_start_time - row.date_start_time
        deltas = deltas.loc[deltas <= timedelta(minutes=threshold)]
    
        try:
            return matches.loc[deltas.idxmin()]
        except ValueError:  # no items
            return row
    
    # replace timestamps in "df1" with closest timestamps in "df2"
    df1 = df1.apply(closest, axis=1)
    
    # join
    cols = ['date_start_time', 'employee_id', 'session_id']
    print df1.merge(df2, on=cols, how='outer')[cols]
    

    which prints

          date_start_time  employee_id  session_id
    0 2016-01-01 02:03:00      7261824   871631182
    1 2016-01-01 06:05:00      7261824   871631183
    2 2016-01-01 11:04:00      7261824   871631184
    3 2016-01-01 14:01:00      7261824   871631185
    4 2016-01-01 14:10:00      7261824   871631185
    

    This approach is significantly slower, since you have to search through the entirety of df2 for each row in df1. What I have written can probably be optimized further, but this will still take a long time on large datasets.

    0 讨论(0)
  • 2020-11-30 04:58

    I would suggest to use the built-in pandas Series dt round function, to round both dataframe to a common time, for example round up to every 5min. So the time will always be in format: 01:00:00 and then 01:05:00 for example. In that way, both dataframe will have similar time index to perform the merge.

    Please see documentation and examples here pandas.Series.dt.round

    0 讨论(0)
  • 2020-11-30 05:03

    I would try using this method in pandas:

    pandas.merge_asof()

    The parameters of interest for you would be direction,tolerance,left_on, and right_on

    Building off @Igor answer:

    import pandas as pd
    from pandas import read_csv
    from io import StringIO
    
    # datetime column (combination of date + start_time)
    dtc = [['date', 'start_time']]
    
    # index column (above combination)
    ixc = 'date_start_time'
    
    df1 = read_csv(StringIO(u'''
    date,start_time,employee_id,session_id
    01/01/2016,02:03:00,7261824,871631182
    01/01/2016,06:03:00,7261824,871631183
    01/01/2016,11:01:00,7261824,871631184
    01/01/2016,14:01:00,7261824,871631185
    '''), parse_dates=dtc)
    
    df2 = read_csv(StringIO(u'''
    date,start_time,employee_id,session_id
    01/01/2016,02:03:00,7261824,871631182
    01/01/2016,06:05:00,7261824,871631183
    01/01/2016,11:04:00,7261824,871631184
    01/01/2016,14:10:00,7261824,871631185
    '''), parse_dates=dtc)
    
    
    
    df1['date_start_time'] = pd.to_datetime(df1['date_start_time'])
    df2['date_start_time'] = pd.to_datetime(df2['date_start_time'])
    
    # converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
    df1.index = df1['date_start_time']
    df2.index = df2['date_start_time']
    # the magic happens below, check the direction and tolerance arguments
    tol = pd.Timedelta('5 minute')
    pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)
    

    output

    date_start_time date_start_time_x   employee_id_x   session_id_x    date_start_time_y   employee_id_y   session_id_y
    
    2016-01-01 02:03:00 2016-01-01 02:03:00 7261824 871631182   2016-01-01 02:03:00 7261824.0   871631182.0
    2016-01-01 06:03:00 2016-01-01 06:03:00 7261824 871631183   2016-01-01 06:05:00 7261824.0   871631183.0
    2016-01-01 11:01:00 2016-01-01 11:01:00 7261824 871631184   2016-01-01 11:04:00 7261824.0   871631184.0
    2016-01-01 14:01:00 2016-01-01 14:01:00 7261824 871631185   NaT NaN NaN
    
    0 讨论(0)
提交回复
热议问题