pd.merge_asof() based on Time-Difference not merging all values - Pandas

前端 未结 2 727
不知归路
不知归路 2021-01-28 17:05

I have two dataframes, one with news and the other with stock price. Both the dataframes have a \"Date\" column. I want to merge them on a gap of 5 days.

Lets say my new

相关标签:
2条回答
  • 2021-01-28 17:49

    You can swap the left and right dataframe:

    df = pd.merge_asof(
            df1,
            df2,
            left_on='News_Dates',
            right_on='Dates',
            tolerance=pd.Timedelta('5D'),
            direction='nearest'
        )
    
    df = df[['Dates', 'News_Dates', 'News', 'Price']]
    print(df)
    
            Dates News_Dates                                               News Price
    0 2018-10-04 2018-09-29  Huge blow to ABC Corp. as they lost the 2012 t... 120
    1 2018-10-04 2018-09-30                           ABC Corp. suffers a loss 120
    2 2018-10-04 2018-10-01                            ABC Corp to Sell stakes 120
    3 2018-12-24 2018-12-20       We are going to comeback strong said ABC CEO 131
    4 2018-12-24 2018-12-22            Shares are down massively for ABC Corp. 131
    
    0 讨论(0)
  • 2021-01-28 17:55

    Here is my solution using numpy

    df_n = pd.DataFrame([('2018-09-29', 'Huge blow to ABC Corp. as they lost the 2012 tax case'), ('2018-09-30', 'ABC Corp. suffers a loss'), ('2018-10-01', 'ABC Corp to Sell stakes'), ('2018-12-20', 'We are going to comeback strong said ABC CEO'), ('2018-12-22', 'Shares are down massively for ABC Corp.')], columns=('News_Dates', 'News'))
    df1_zscore = pd.DataFrame([('2018-10-04', '120'), ('2018-12-24', '131')], columns=('Dates', 'Price'))
    
    df_n["News_Dates"] = pd.to_datetime(df_n["News_Dates"])
    df1_zscore["Dates"] = pd.to_datetime(df1_zscore["Dates"])
    
    n_dates = df_n["News_Dates"].values
    p_dates = df1_zscore[["Dates"]].values
    
    ## substract each pair of n_dates and p_dates and create a matrix
    mat_date_compare = (p_dates - n_dates).astype('timedelta64[D]')
    
    ## get matrix of boolean for which difference is between 0 and 5 day
    ## to be used as index for original array
    comparision =  (mat_date_compare <= pd.Timedelta("5d")) & (mat_date_compare >= pd.Timedelta("0d"))
    
    ## get cell numbers which is in range 0 to matrix size which meets the condition
    ind = np.arange(len(n_dates)*len(p_dates))[comparision.ravel()]
    
    
    ## calculate row and column index from cell number to index the df
    pd.concat([df1_zscore.iloc[ind//len(n_dates)].reset_index(drop=True), 
               df_n.iloc[ind%len(n_dates)].reset_index(drop=True)], sort=False, axis=1)
    

    Result

    Dates   Price   News_Dates  News
    0   2018-10-04  120 2018-09-29  Huge blow to ABC Corp. as they lost the 2012 t...
    1   2018-10-04  120 2018-09-30  ABC Corp. suffers a loss
    2   2018-10-04  120 2018-10-01  ABC Corp to Sell stakes
    3   2018-12-24  131 2018-12-20  We are going to comeback strong said ABC CEO
    4   2018-12-24  131 2018-12-22  Shares are down massively for ABC Corp.
    
    0 讨论(0)
提交回复
热议问题