Match rows in one Pandas dataframe to another based on three columns

后端 未结 2 1822
陌清茗
陌清茗 2021-02-07 20:38

I have two Pandas dataframes, one quite large (30000+ rows) and one a lot smaller (100+ rows).

The dfA looks something like:

      X     Y    ONSET_TIME          


        
相关标签:
2条回答
  • 2021-02-07 21:01

    Use merge() - it works like JOIN in SQL - and you have first part done.

    d1 = '''      X     Y    ONSET_TIME    COLOUR 
       104    78          1083         6    
       172    78          1083        16
       240    78          1083        15 
       308    78          1083         8
       376    78          1083         8
       444    78          1083        14
       512    78          1083        14
       308    78          3000        14
       308    78          2000        14''' 
    
    
    d2 = '''    TIME     X     Y
          7   512   350 
       1722   512   214 
       1906   376   214 
       2095   376   146 
       2234   308    78 
       2406   172   146'''
    
    import pandas as pd
    from StringIO import StringIO
    
    dfA = pd.DataFrame.from_csv(StringIO(d1), sep='\s+', index_col=None)
    #print dfA
    
    dfB = pd.DataFrame.from_csv(StringIO(d2), sep='\s+', index_col=None)
    #print dfB
    
    df1 =  pd.merge(dfA, dfB, on=['X','Y'])
    print df1
    

    result:

         X   Y  ONSET_TIME  COLOUR  TIME
    0  308  78        1083       8  2234
    1  308  78        3000      14  2234
    2  308  78        2000      14  2234
    

    Then you can use it to filter results.

    df2 = df1[ df1['ONSET_TIME'] < df1['TIME'] ]
    print df2
    

    result:

         X   Y  ONSET_TIME  COLOUR  TIME
    0  308  78        1083       8  2234
    2  308  78        2000      14  2234
    
    0 讨论(0)
  • 2021-02-07 21:02

    There is probably an even more efficient way to do this, but here is a method without those slow for loops:

    import pandas as pd
    
    dfB = pd.DataFrame({'X':[1,2,3],'Y':[1,2,3], 'Time':[10,20,30]})
    dfA = pd.DataFrame({'X':[1,1,2,2,2,3],'Y':[1,1,2,2,2,3], 'ONSET_TIME':[5,7,9,16,22,28],'COLOR': ['Red','Blue','Blue','red','Green','Orange']})
    
    #create one single table
    mergeDf = pd.merge(dfA, dfB, left_on = ['X','Y'], right_on = ['X','Y'])
    #remove rows where time is less than onset time
    filteredDf = mergeDf[mergeDf['ONSET_TIME'] < mergeDf['Time']]
    #take min time (closest to onset time)
    groupedDf = filteredDf.groupby(['X','Y']).max()
    
    print filteredDf
    
     COLOR  ONSET_TIME  X  Y  Time
    0     Red           5  1  1    10
    1    Blue           7  1  1    10
    2    Blue           9  2  2    20
    3     red          16  2  2    20
    5  Orange          28  3  3    30
    
    
    print groupedDf
    
    COLOR  ONSET_TIME  Time
    X Y                          
    1 1     Red           7    10
    2 2     red          16    20
    3 3  Orange          28    30
    

    The basic idea is to merge the two tables so you have the times together in one table. Then I filtered on the recs that are the largest (closest to the time on your dfB). Let me know if you have questions about this.

    0 讨论(0)
提交回复
热议问题