How to join two dataframes for which column values are within a certain range?

后端 未结 5 788
清酒与你
清酒与你 2020-11-22 07:53

Given two dataframes df_1 and df_2, how to join them such that datetime column df_1 is in between start and end

相关标签:
5条回答
  • 2020-11-22 08:35

    In this method, we assume TimeStamp objects are used.

    df2  start                end                  event    
       0 2016-05-14 10:54:31  2016-05-14 10:54:33  E1
       1 2016-05-14 10:54:34  2016-05-14 10:54:37  E2
       2 2016-05-14 10:54:38  2016-05-14 10:54:42  E3
    
    event_num = len(df2.event)
    
    def get_event(t):    
        event_idx = ((t >= df2.start) & (t <= df2.end)).dot(np.arange(event_num))
        return df2.event[event_idx]
    
    df1["event"] = df1.timestamp.transform(get_event)
    

    Explanation of get_event

    For each timestamp in df1, say t0 = 2016-05-14 10:54:33,

    (t0 >= df2.start) & (t0 <= df2.end) will contain 1 true. (See example 1). Then, take a dot product with np.arange(event_num) to get the index of the event that a t0 belongs to.

    Examples:

    Example 1

        t0 >= df2.start    t0 <= df2.end     After &     np.arange(3)    
    0     True                True         ->  T              0        event_idx
    1    False                True         ->  F              1     ->     0
    2    False                True         ->  F              2
    

    Take t2 = 2016-05-14 10:54:35 for another example

        t2 >= df2.start    t2 <= df2.end     After &     np.arange(3)    
    0     True                False        ->  F              0        event_idx
    1     True                True         ->  T              1     ->     1
    2    False                True         ->  F              2
    

    We finally use transform to transform each timestamp into an event.

    0 讨论(0)
  • 2020-11-22 08:41

    You can use the module pandasql

    import pandasql as ps
    
    sqlcode = '''
    select df_1.timestamp
    ,df_1.A
    ,df_1.B
    ,df_2.event
    from df_1 
    inner join df_2 
    on d1.timestamp between df_2.start and df2.end
    '''
    
    newdf = ps.sqldf(sqlcode,locals())
    
    0 讨论(0)
  • 2020-11-22 08:46

    Option 1

    idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
    df_2.index=idx
    df_1['event']=df_2.loc[df_1.timestamp,'event'].values
    

    Option 2

    df_2['timestamp']=df_2['end']
    pd.merge_asof(df_1,df_2[['timestamp','event']],on='timestamp',direction ='forward',allow_exact_matches =True)
    Out[405]: 
                timestamp         A         B event
    0 2016-05-14 10:54:33  0.020228  0.026572    E1
    1 2016-05-14 10:54:34  0.057780  0.175499    E2
    2 2016-05-14 10:54:35  0.098808  0.620986    E2
    3 2016-05-14 10:54:36  0.158789  1.014819    E2
    4 2016-05-14 10:54:39  0.038129  2.384590    E3
    
    0 讨论(0)
  • 2020-11-22 08:51

    One simple solution is create interval index from start and end setting closed = both then use get_loc to get the event i.e (Hope all the date times are in timestamps dtype )

    df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
    df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])
    

    Output :

                timestamp         A         B event
    0 2016-05-14 10:54:33  0.020228  0.026572    E1
    1 2016-05-14 10:54:34  0.057780  0.175499    E2
    2 2016-05-14 10:54:35  0.098808  0.620986    E2
    3 2016-05-14 10:54:36  0.158789  1.014819    E2
    4 2016-05-14 10:54:39  0.038129  2.384590    E3
    
    0 讨论(0)
  • 2020-11-22 08:54

    First use IntervalIndex to create a reference index based on the interval of interest, then use get_indexer to slice the dataframe which contains the discrete events of interest.

    idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
    event = df_2.iloc[idx.get_indexer(df_1.timestamp), 'event']
    
    event
    0    E1
    1    E2
    1    E2
    1    E2
    2    E3
    Name: event, dtype: object
    
    df_1['event'] = event.to_numpy()
    df_1
                timestamp         A         B event
    0 2016-05-14 10:54:33  0.020228  0.026572    E1
    1 2016-05-14 10:54:34  0.057780  0.175499    E2
    2 2016-05-14 10:54:35  0.098808  0.620986    E2
    3 2016-05-14 10:54:36  0.158789  1.014819    E2
    4 2016-05-14 10:54:39  0.038129  2.384590    E3
    

    Reference: A question on IntervalIndex.get_indexer.

    0 讨论(0)
提交回复
热议问题