Python Pandas: How to merge based on an “OR” condition?

后端未结

关注

 2  1827

甜味超标 2021-02-13 03:55

Let\'s say I have two dataframes, and the column names for both are:

table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[Shi


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   遥遥无期
                                             
                
                
                (楼主)
            
              
              
                2021-02-13 04:14
              

            
            
                        
Use merge() and concat().  Then drop any duplicate cases where both A and B match (thanks @Scott Boston for that final step).

df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})

df1         df2
   A  B        A  B
0  1  7     0  1  4
1  2  8     1  5  1
2  3  9     2  6  8
3  4  5     3  4  5


With these data frames we should see:  


df1.loc[0] matches A on df2.loc[0]  
df1.loc[1] matches B on df2.loc[2]
df1.loc[3] matches both A and B on df2.loc[3]


We'll use suffixes to keep track of what matched where:

suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']

pd.concat([df1.merge(df2, on='A', suffixes=suff_A), 
           df1.merge(df2, on='B', suffixes=suff_B)])

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
1  4.0             NaN             NaN  NaN             5.0             5.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN


Note that the second and fourth rows are duplicate matches (for both data frames, A = 4 and B = 5).  We need to remove one of those sets.

dupes = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dupes]

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复