pandas.merge: match the nearest time stamp >= the series of timestamps

后端 未结 4 976
再見小時候
再見小時候 2020-11-29 03:16

I have two dataframes, both of which contain an irregularly spaced, millisecond resolution timestamp column. My goal here is to match up the rows so that for each matched ro

相关标签:
4条回答
  • 2020-11-29 03:29

    merge() can't do this kind of join, but you can use searchsorted():

    Create some random timestamps: t1, t2, there are in ascending order:

    import pandas as pd
    import numpy as np
    np.random.seed(0)
    
    base = np.array(["2013-01-01 00:00:00"], "datetime64[ns]")
    
    a = (np.random.rand(30)*1000000*1000).astype(np.int64)*1000000
    t1 = base + a
    t1.sort()
    
    b = (np.random.rand(10)*1000000*1000).astype(np.int64)*1000000
    t2 = base + b
    t2.sort()
    

    call searchsorted() to find index in t1 for every value in t2:

    idx = np.searchsorted(t1, t2) - 1
    mask = idx >= 0
    
    df = pd.DataFrame({"t1":t1[idx][mask], "t2":t2[mask]})
    

    here is the output:

                             t1                         t2
    0 2013-01-02 06:49:13.287000 2013-01-03 16:29:15.612000
    1 2013-01-05 16:33:07.211000 2013-01-05 21:42:30.332000
    2 2013-01-07 04:47:24.561000 2013-01-07 04:53:53.948000
    3 2013-01-07 14:26:03.376000 2013-01-07 17:01:35.722000
    4 2013-01-07 14:26:03.376000 2013-01-07 18:22:13.996000
    5 2013-01-07 14:26:03.376000 2013-01-07 18:33:55.497000
    6 2013-01-08 02:24:54.113000 2013-01-08 12:23:40.299000
    7 2013-01-08 21:39:49.366000 2013-01-09 14:03:53.689000
    8 2013-01-11 08:06:36.638000 2013-01-11 13:09:08.078000
    

    To view this result by graph:

    import pylab as pl
    pl.figure(figsize=(18, 4))
    pl.vlines(pd.Series(t1), 0, 1, colors="g", lw=1)
    pl.vlines(df.t1, 0.3, 0.7, colors="r", lw=2)
    pl.vlines(df.t2, 0.3, 0.7, colors="b", lw=2)
    pl.margins(0.02)
    

    output:

    enter image description here

    The green lines are t1, blue lines are t2, red lines are selected from t1 for every t2.

    0 讨论(0)
  • 2020-11-29 03:40

    I used a different way than HYRY:

    1. do a regular merge with outer join (how='outer');
    2. sort it by date;
    3. use fillna(method='pad') to take fill just the columns you need and 'pad' if you would like to take the previous filled row;
    4. drop all the rows you don't need from the outer join.

    All this can be written in few lines:

    df=pd.merge(df0, df1, on='Date', how='outer')   
    df=df.sort(['Date'], ascending=[1])
    headertofill=list(df1.columns.values)
    df[headertofill]=df[headertofill].fillna(method='pad')
    df=df[pd.isnull(df[var_from_df0_only])==False] 
    
    0 讨论(0)
  • 2020-11-29 03:42

    Pandas now has the function merge_asof, doing exactly what was described in the accepted answer.

    0 讨论(0)
  • 2020-11-29 03:42

    Here is a simpler and more general method.

    # data and signal are want we want to merge
    keys = ['channel', 'timestamp']  # Could be simply ['timestamp']
    index = data.loc[keys].set_index(keys).index  # Make index from columns to merge on
    padded = signal.reindex(index, method='pad')  # Key step -- reindex with filling
    joined = data.join(padded, on=keys)  # Join to data if needed
    
    0 讨论(0)
提交回复
热议问题