Pandas: how to merge two dataframes on offset dates?

前端 未结 2 776
萌比男神i
萌比男神i 2021-01-22 09:13

I\'d like to merge two dataframes, df1 & df2, based on whether rows of df2 fall within a 3-6 month date range after rows of df1. For example:

df1 (for each company I

相关标签:
2条回答
  • 2021-01-22 09:19

    This is actually one of those rare questions where the algorithmic complexity might be significantly different for different solutions. You might want to consider this over the niftiness of 1-liner snippets.

    Algorithmically:

    • sort the larger of the dataframes according to the date

    • for each date in the smaller dataframe, use the bisect module to find the relevant rows in the larger dataframe

    For dataframes with lengths m and n, respectively (m < n) the complexity should be O(m log(n)).

    0 讨论(0)
  • 2021-01-22 09:42

    This is my solution going off of the algorithm that Ami Tavory suggested below:

    #find the date offsets to define date ranges
    start_time = df1.DATADATE.apply(pd.offsets.MonthEnd(3))
    end_time = df1.DATADATE.apply(pd.offsets.MonthEnd(6))
    
    #make these extra columns
    df1['start_time'] = start_time
    df1['end_time'] = end_time
    
    #find unique company names in both dfs
    unique_companies_df1 = df1.company.unique()
    unique_companies_df2 = df2.company.unique()
    
    #sort df1 by company and DATADATE, so we can iterate in a sensible order
    sorted_df1=df1.sort(['company','DATADATE']).reset_index(drop=True)
    
    #define empty df to append data
    df3 = pd.DataFrame()
    
    #iterate through each company in df1, find 
    #that company in sorted df2, then for each 
    #DATADATE quarter of df1, bisect df2 in the 
    #correct locations (i.e. start_time to end_time)
    
    for cmpny in unique_companies_df1:
    
        if cmpny in unique_companies_df2: #if this company is in both dfs, take the relevant rows that are associated with this company 
            selected_df2 = df2[df2.company==cmpny].sort('EventDate').reset_index(drop=True)
            selected_df1 = sorted_df1[sorted_df1.company==cmpny].reset_index(drop=True)
    
            for quarter in xrange(len(selected_df1.DATADATE)): #iterate through each DATADATE quarter in df1
                lo=bisect.bisect_right(selected_df2.EventDate,selected_CS.start_time[quarter]) #bisect_right to ensure that we do not include dates before our date range
                hi=bisect.bisect_left(selected_IT.EventDate,selected_CS.end_time[quarter]) #bisect_left here to not include dates after our desired date range            
                df_right = selected_df2.loc[lo:hi].copy()  #grab all rows with EventDates that fall within our date range
                df_left = pd.DataFrame(selected_df1.loc[quarter]).transpose()
    
                if len(df_right)==0: # if no EventDates fall within range, create a row with cmpny in the 'company' column, and a NaT in the EventDate column to merge
                    df_right.loc[0,'company']=cmpny
    
                temp = pd.merge(df_left,df_right,how='inner',on='company') #merge the df1 company quarter with all df2's rows that fell within date range
                df3=df3.append(temp)
    
    0 讨论(0)
提交回复
热议问题