How to count overlap rows among multiple dataframes?

前端 未结 3 669
庸人自扰
庸人自扰 2021-01-19 15:44

I have a multiple dataframe like below.

df1 = pd.DataFrame({\'Col1\':[\"aaa\",\"ffffd\",\"ggg\"],\'Col2\':[\"bbb\",\"eee\",\"hhh\"],\'Col3\':\"ccc\",\"fff\",\"         


        
相关标签:
3条回答
  • 2021-01-19 16:21

    Here is one way using concat and get_dummies:

    l = [df1,df2,df3] #create a list of dataframes
    final = pd.concat([i.assign(key=f"df{e+1}") for e,i in enumerate(l)],sort=False)
    
    final = (final.assign(**pd.get_dummies(final.pop('key')))
            .groupby(['Col1','Col2','Col3']).max().reset_index())
    

      Col1 Col2 Col3  df1  df2  df3
    0  aaa  bbb  ccc    1    1    0
    1  ffffd  eee  fff    1    0    0
    2  ggg  hhh  iii    1    0    0
    3  ppp  ttt  qqq    0    0    1
    4  qqq  eee  www    0    1    1
    5  rrr  ttt  yyy    0    0    1
    6  zzz  xxx  yyy    0    1    1
    
    0 讨论(0)
  • 2021-01-19 16:31

    Using pandas.concat and groupby:

    dfs = [df1,df2,df3]
    dfs = [d.assign(df='df%s' % n) for n, d in enumerate(dfs, start=1)]
    new_df = pd.concat(dfs).groupby(['Col1', 'Col2', 'Col3','df']).size().unstack(fill_value=0)
    print(new_df)
    

    Output:

    df              df1  df2  df3
    Col1 Col2 Col3               
    aaa  bbb  ccc     1    1    0
    ffffd  eee  fff     1    0    0
    ggg  hhh  iii     1    0    0
    ppp  ttt  qqq     0    0    1
    qqq  eee  www     0    1    1
    rrr  ttt  yyy     0    0    1
    zzz  xxx  yyy     0    1    1
    
    0 讨论(0)
  • 2021-01-19 16:38

    Setup:

    df1 = pd.DataFrame({'Col1':["aaa","ffffd","ggg"],'Col2':["bbb","eee","hhh"],'Col3':["ccc","fff","iii"]})
    df2= pd.DataFrame({'Col1':["aaa","zzz","qqq"],'Col2':["bbb","xxx","eee"],'Col3':["ccc", "yyy","www"]})
    df3= pd.DataFrame({'Col1':["rrr","zzz","qqq","ppp"],'Col2':["ttt","xxx","eee","ttt"],'Col3':["yyy","yyy","www","qqq"]})
    

    Solution:

    First create a indicate column for each dataframe, then concat, groupby and sum.

    df1['df1'] = df2['df2'] = df3['df3'] = 1
    (
        pd.concat([df1, df2, df3], sort=False)
        .groupby(by=['Col1', 'Col2', 'Col3'])
        .max().astype(int)
        .reset_index()
    )
    
            Col1    Col2    Col3    df1 df2 df3
    0       aaa     bbb     ccc     1   1   0
    1       ffffd     eee     fff     1   0   0
    2       ggg     hhh     iii     1   0   0
    3       ppp     ttt     qqq     0   0   1
    4       qqq     eee     www     0   1   1
    5       rrr     ttt     yyy     0   0   1
    6       zzz     xxx     yyy     0   1   1
    
    0 讨论(0)
提交回复
热议问题