Pandas Merge on Name and Closest Date

后端 未结 3 654
北恋
北恋 2020-12-08 12:00

I am trying to merge two dataframes on both name and the closest date (WRT the left hand dataframe). In my research I found one similar question here but it doesn\'t account

相关标签:
3条回答
  • 2020-12-08 12:26

    I'd also love to see the final solution you came up with to know how it shook out in the end.

    One thing you can do to find the closest date might be something to calc the number of days between each date in the first DataFrame and the dates in the second DataFrame. Then you can use np.argmin to retrieve the date with the smallest time delta.

    For example:

    Setup

    #!/usr/bin/env python
    # -*- coding: utf-8 -*- 
    import numpy as np
    import pandas as pd
    from pandas.io.parsers import StringIO
    

    Data

    a = """timepoint,measure
    2014-01-01 00:00:00,78
    2014-01-02 00:00:00,29
    2014-01-03 00:00:00,5
    2014-01-04 00:00:00,73
    2014-01-05 00:00:00,40
    2014-01-06 00:00:00,45
    2014-01-07 00:00:00,48
    2014-01-08 00:00:00,2
    2014-01-09 00:00:00,96
    2014-01-10 00:00:00,82
    2014-01-11 00:00:00,61
    2014-01-12 00:00:00,68
    2014-01-13 00:00:00,8
    2014-01-14 00:00:00,94
    2014-01-15 00:00:00,16
    2014-01-16 00:00:00,31
    2014-01-17 00:00:00,10
    2014-01-18 00:00:00,34
    2014-01-19 00:00:00,27
    2014-01-20 00:00:00,58
    2014-01-21 00:00:00,90
    2014-01-22 00:00:00,41
    2014-01-23 00:00:00,97
    2014-01-24 00:00:00,7
    2014-01-25 00:00:00,86
    2014-01-26 00:00:00,62
    2014-01-27 00:00:00,91
    2014-01-28 00:00:00,0
    2014-01-29 00:00:00,73
    2014-01-30 00:00:00,22
    2014-01-31 00:00:00,43
    2014-02-01 00:00:00,87
    2014-02-02 00:00:00,56
    2014-02-03 00:00:00,45
    2014-02-04 00:00:00,25
    2014-02-05 00:00:00,92
    2014-02-06 00:00:00,83
    2014-02-07 00:00:00,13
    2014-02-08 00:00:00,50
    2014-02-09 00:00:00,48
    2014-02-10 00:00:00,78"""
    
    b = """timepoint,measure
    2014-01-01 00:00:00,78
    2014-01-08 00:00:00,29
    2014-01-15 00:00:00,5
    2014-01-22 00:00:00,73
    2014-01-29 00:00:00,40
    2014-02-05 00:00:00,45
    2014-02-12 00:00:00,48
    2014-02-19 00:00:00,2
    2014-02-26 00:00:00,96
    2014-03-05 00:00:00,82
    2014-03-12 00:00:00,61
    2014-03-19 00:00:00,68
    2014-03-26 00:00:00,8
    2014-04-02 00:00:00,94
    """
    

    look at data

    df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
    df1.head()
    
       timepoint  measure
    0 2014-01-01       78
    1 2014-01-02       29
    2 2014-01-03        5
    3 2014-01-04       73
    4 2014-01-05       40
    
    df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
    df2.head()
    
       timepoint  measure
    0 2014-01-01       78
    1 2014-01-08       29
    2 2014-01-15        5
    3 2014-01-22       73
    4 2014-01-29       40
    

    Func to find the closest date to a given date

    def find_closest_date(timepoint, time_series, add_time_delta_column=True):
        # takes a pd.Timestamp() instance and a pd.Series with dates in it
        # calcs the delta between `timepoint` and each date in `time_series`
        # returns the closest date and optionally the number of days in its time delta
        deltas = np.abs(time_series - timepoint)
        idx_closest_date = np.argmin(deltas)
        res = {"closest_date": time_series.ix[idx_closest_date]}
        idx = ['closest_date']
        if add_time_delta_column:
            res["closest_delta"] = deltas[idx_closest_date]
            idx.append('closest_delta')
        return pd.Series(res, index=idx)
    
    df1[['closest', 'days_bt_x_and_y']] = df1.timepoint.apply(
                                              find_closest_date, args=[df2.timepoint])
    df1.head(10)
    
       timepoint  measure    closest  days_bt_x_and_y
    0 2014-01-01       78 2014-01-01           0 days
    1 2014-01-02       29 2014-01-01           1 days
    2 2014-01-03        5 2014-01-01           2 days
    3 2014-01-04       73 2014-01-01           3 days
    4 2014-01-05       40 2014-01-08           3 days
    5 2014-01-06       45 2014-01-08           2 days
    6 2014-01-07       48 2014-01-08           1 days
    7 2014-01-08        2 2014-01-08           0 days
    8 2014-01-09       96 2014-01-08           1 days
    9 2014-01-10       82 2014-01-08           2 days
    

    Merge the two DataFrames on the new closest date column

    df3 = pd.merge(df1, df2, left_on=['closest'], right_on=['timepoint'])
    
    colorder = [
        'timepoint_x',
        'closest',
        'timepoint_y',
        'days_bt_x_and_y',
        'measure_x',
        'measure_y'
    ]
    
    df3 = df3.ix[:, colorder]
    df3
    
       timepoint_x    closest timepoint_y  days_bt_x_and_y  measure_x  measure_y
    0   2014-01-01 2014-01-01  2014-01-01           0 days         78         78
    1   2014-01-02 2014-01-01  2014-01-01           1 days         29         78
    2   2014-01-03 2014-01-01  2014-01-01           2 days          5         78
    3   2014-01-04 2014-01-01  2014-01-01           3 days         73         78
    4   2014-01-05 2014-01-08  2014-01-08           3 days         40         29
    5   2014-01-06 2014-01-08  2014-01-08           2 days         45         29
    6   2014-01-07 2014-01-08  2014-01-08           1 days         48         29
    7   2014-01-08 2014-01-08  2014-01-08           0 days          2         29
    8   2014-01-09 2014-01-08  2014-01-08           1 days         96         29
    9   2014-01-10 2014-01-08  2014-01-08           2 days         82         29
    10  2014-01-11 2014-01-08  2014-01-08           3 days         61         29
    11  2014-01-12 2014-01-15  2014-01-15           3 days         68          5
    12  2014-01-13 2014-01-15  2014-01-15           2 days          8          5
    13  2014-01-14 2014-01-15  2014-01-15           1 days         94          5
    14  2014-01-15 2014-01-15  2014-01-15           0 days         16          5
    15  2014-01-16 2014-01-15  2014-01-15           1 days         31          5
    16  2014-01-17 2014-01-15  2014-01-15           2 days         10          5
    17  2014-01-18 2014-01-15  2014-01-15           3 days         34          5
    18  2014-01-19 2014-01-22  2014-01-22           3 days         27         73
    19  2014-01-20 2014-01-22  2014-01-22           2 days         58         73
    20  2014-01-21 2014-01-22  2014-01-22           1 days         90         73
    21  2014-01-22 2014-01-22  2014-01-22           0 days         41         73
    22  2014-01-23 2014-01-22  2014-01-22           1 days         97         73
    23  2014-01-24 2014-01-22  2014-01-22           2 days          7         73
    24  2014-01-25 2014-01-22  2014-01-22           3 days         86         73
    25  2014-01-26 2014-01-29  2014-01-29           3 days         62         40
    26  2014-01-27 2014-01-29  2014-01-29           2 days         91         40
    27  2014-01-28 2014-01-29  2014-01-29           1 days          0         40
    28  2014-01-29 2014-01-29  2014-01-29           0 days         73         40
    29  2014-01-30 2014-01-29  2014-01-29           1 days         22         40
    30  2014-01-31 2014-01-29  2014-01-29           2 days         43         40
    31  2014-02-01 2014-01-29  2014-01-29           3 days         87         40
    32  2014-02-02 2014-02-05  2014-02-05           3 days         56         45
    33  2014-02-03 2014-02-05  2014-02-05           2 days         45         45
    34  2014-02-04 2014-02-05  2014-02-05           1 days         25         45
    35  2014-02-05 2014-02-05  2014-02-05           0 days         92         45
    36  2014-02-06 2014-02-05  2014-02-05           1 days         83         45
    37  2014-02-07 2014-02-05  2014-02-05           2 days         13         45
    38  2014-02-08 2014-02-05  2014-02-05           3 days         50         45
    39  2014-02-09 2014-02-12  2014-02-12           3 days         48         48
    40  2014-02-10 2014-02-12  2014-02-12           2 days         78         48
    
    0 讨论(0)
  • 2020-12-08 12:35

    This is super late, but hopefully its helpful for new answer seekers. I answered a similar question here

    with a somewhat new method in pandas:

    pandas.merge_asof()

    The parameters of interest for you would be direction,tolerance,left_on, and right_on

    Building off @hernamesbarbara answer & data:

    data

    a = """timepoint,measure
    2014-01-01 00:00:00,78
    2014-01-02 00:00:00,29
    2014-01-03 00:00:00,5
    2014-01-04 00:00:00,73
    2014-01-05 00:00:00,40
    2014-01-06 00:00:00,45
    2014-01-07 00:00:00,48
    2014-01-08 00:00:00,2
    2014-01-09 00:00:00,96
    2014-01-10 00:00:00,82
    2014-01-11 00:00:00,61
    2014-01-12 00:00:00,68
    2014-01-13 00:00:00,8
    2014-01-14 00:00:00,94
    2014-01-15 00:00:00,16
    2014-01-16 00:00:00,31
    2014-01-17 00:00:00,10
    2014-01-18 00:00:00,34
    2014-01-19 00:00:00,27
    2014-01-20 00:00:00,58
    2014-01-21 00:00:00,90
    2014-01-22 00:00:00,41
    2014-01-23 00:00:00,97
    2014-01-24 00:00:00,7
    2014-01-25 00:00:00,86
    2014-01-26 00:00:00,62
    2014-01-27 00:00:00,91
    2014-01-28 00:00:00,0
    2014-01-29 00:00:00,73
    2014-01-30 00:00:00,22
    2014-01-31 00:00:00,43
    2014-02-01 00:00:00,87
    2014-02-02 00:00:00,56
    2014-02-03 00:00:00,45
    2014-02-04 00:00:00,25
    2014-02-05 00:00:00,92
    2014-02-06 00:00:00,83
    2014-02-07 00:00:00,13
    2014-02-08 00:00:00,50
    2014-02-09 00:00:00,48
    2014-02-10 00:00:00,78"""
    
    b = """timepoint,measure
    2014-01-01 00:00:00,78
    2014-01-08 00:00:00,29
    2014-01-15 00:00:00,5
    2014-01-22 00:00:00,73
    2014-01-29 00:00:00,40
    2014-02-05 00:00:00,45
    2014-02-12 00:00:00,48
    2014-02-19 00:00:00,2
    2014-02-26 00:00:00,96
    2014-03-05 00:00:00,82
    2014-03-12 00:00:00,61
    2014-03-19 00:00:00,68
    2014-03-26 00:00:00,8
    2014-04-02 00:00:00,94
    """
    

    solution

    import pandas as pd
    from pandas import read_csv
    from io import StringIO
    
    df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
    df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
    
    
    
    df1['timepoint'] = pd.to_datetime(df1['timepoint'])
    df2['timepoint'] = pd.to_datetime(df2['timepoint'])
    
    # converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
    df1.index = df1['timepoint']
    df2.index = df2['timepoint']
    # the magic happens below, check the direction and tolerance arguments
    # if you want you can make a maximum tolerance on which to merge data
    tol = pd.Timedelta('3 day')
    df3 = pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)
    

    output

    df3.head()
    
        timepoint_x measure_x   timepoint_y measure_y
    timepoint               
    2014-01-01  2014-01-01  78  2014-01-01  78
    2014-01-02  2014-01-02  29  2014-01-01  78
    2014-01-03  2014-01-03  5   2014-01-01  78
    2014-01-04  2014-01-04  73  2014-01-01  78
    2014-01-05  2014-01-05  40  2014-01-08  29
    
    0 讨论(0)
  • 2020-12-08 12:36

    A small addition to the hernamesbarbara's code

    def find_closest_date(timepoint, time_series, add_time_delta_column=True, mode="abs"):
        """takes a pd.Timestamp() instance and a pd.Series with dates in it
        calcs the delta between `timepoint` and each date in `time_series`
        returns the closest date and optionally the number of days in its time delta
    
        Parameters
        ----------
        mode: "abs" (default), "left", "right"
            closest datetime by abs, at left, at right
    
        References
        ----------
        .. [1] http://stackoverflow.com/a/25962323/716469
        """
        deltas = time_series - timepoint
    
        idx_closest_date = None
        if mode == "abs":
            idx_closest_date = np.argmin(abs(deltas))
        elif mode == "left":
            deltas_ = deltas[deltas <= pd.Timedelta('0 days 00:00:00.0')]
            if len(deltas_):
                idx_closest_date = np.argmax(deltas_)
        elif mode == "right":
            deltas_ = deltas[deltas >= pd.Timedelta('0 days 00:00:00.0')]
            if len(deltas_):
                idx_closest_date = np.argmin(deltas_)
        else:
            raise Exception("Mode is incorrect")
    
        if idx_closest_date is not None:
            closest_date = time_series.ix[idx_closest_date]
            if add_time_delta_column:
                closest_delta = deltas[idx_closest_date]
        else:
            closest_date = pd.NaT
            if add_time_delta_column:
                closest_delta = pd.Timedelta(pd.NaT)
    
        res = {"closest_date": closest_date}
        idx = ['closest_date']
        if add_time_delta_column:
            res["closest_delta"] = closest_delta
            idx.append('closest_delta')
    
        return pd.Series(res, index=idx)
    
    0 讨论(0)
提交回复
热议问题