How to count overlap rows among multiple dataframes?

前端未结

关注

 3  669

I have a multiple dataframe like below.

df1 = pd.DataFrame({\'Col1\':[\"aaa\",\"ffffd\",\"ggg\"],\'Col2\':[\"bbb\",\"eee\",\"hhh\"],\'Col3\':\"ccc\",\"fff\",\"


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  眼角桃花        
                
              
                            
                2021-01-19 16:21
              
            
            
                                                                       
Here is one way using concat and get_dummies:

l = [df1,df2,df3] #create a list of dataframes
final = pd.concat([i.assign(key=f"df{e+1}") for e,i in enumerate(l)],sort=False)

final = (final.assign(**pd.get_dummies(final.pop('key')))
        .groupby(['Col1','Col2','Col3']).max().reset_index())




  Col1 Col2 Col3  df1  df2  df3
0  aaa  bbb  ccc    1    1    0
1  ffffd  eee  fff    1    0    0
2  ggg  hhh  iii    1    0    0
3  ppp  ttt  qqq    0    0    1
4  qqq  eee  www    0    1    1
5  rrr  ttt  yyy    0    0    1
6  zzz  xxx  yyy    0    1    1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2021-01-19 16:31
              
            
            
                                                                       
Using pandas.concat and groupby:

dfs = [df1,df2,df3]
dfs = [d.assign(df='df%s' % n) for n, d in enumerate(dfs, start=1)]
new_df = pd.concat(dfs).groupby(['Col1', 'Col2', 'Col3','df']).size().unstack(fill_value=0)
print(new_df)


Output:

df              df1  df2  df3
Col1 Col2 Col3               
aaa  bbb  ccc     1    1    0
ffffd  eee  fff     1    0    0
ggg  hhh  iii     1    0    0
ppp  ttt  qqq     0    0    1
qqq  eee  www     0    1    1
rrr  ttt  yyy     0    0    1
zzz  xxx  yyy     0    1    1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情话喂你        
                
              
                            
                2021-01-19 16:38
              
            
            
                                                                       
Setup:

df1 = pd.DataFrame({'Col1':["aaa","ffffd","ggg"],'Col2':["bbb","eee","hhh"],'Col3':["ccc","fff","iii"]})
df2= pd.DataFrame({'Col1':["aaa","zzz","qqq"],'Col2':["bbb","xxx","eee"],'Col3':["ccc", "yyy","www"]})
df3= pd.DataFrame({'Col1':["rrr","zzz","qqq","ppp"],'Col2':["ttt","xxx","eee","ttt"],'Col3':["yyy","yyy","www","qqq"]})


Solution:

First create a indicate column for each dataframe, then concat, groupby and sum.

df1['df1'] = df2['df2'] = df3['df3'] = 1
(
    pd.concat([df1, df2, df3], sort=False)
    .groupby(by=['Col1', 'Col2', 'Col3'])
    .max().astype(int)
    .reset_index()
)

        Col1    Col2    Col3    df1 df2 df3
0       aaa     bbb     ccc     1   1   0
1       ffffd     eee     fff     1   0   0
2       ggg     hhh     iii     1   0   0
3       ppp     ttt     qqq     0   0   1
4       qqq     eee     www     0   1   1
5       rrr     ttt     yyy     0   0   1
6       zzz     xxx     yyy     0   1   1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复