Concatenate specific pairs of columns in a dataframe based on a reference dataframe with a different index

前端 未结 2 1857
不思量自难忘°
不思量自难忘° 2021-01-14 18:24

My goal is to concatenate columns in a dataframe(Source), based on pairs that are described in a separate dataframe(Reference). The resulting dataframe should replace the c

相关标签:
2条回答
  • 2021-01-14 18:31

    There will likely be better solutions, but at least this one is working:

    import pandas as pd
    
    df1 = pd.DataFrame({'FIRST': pd.Series(['Alpha', 'Alpha', 'Charlie'],
                                           index=['H1', 'H2',  'H3']),
                        'SECOND': pd.Series(['Bravo', 'Delta', 'Delta'],
                                            index=['H1', 'H2', 'H3'])})
    
    df2 = pd.DataFrame({'Alpha' : pd.Series(['A', 'C'], index = ['item-000', 'item-111']),
                        'Bravo' : pd.Series(['A', 'C'], index = ['item-000', 'item-111']),
                        'Delta' : pd.Series(['T', 'C'], index = ['item-000', 'item-111']),
                        'Charlie' : pd.Series(['T', 'G'], index = ['item-000', 'item-111'])})
    
    pd.concat((df1.T.apply(lambda x: x.map(df2.loc[idx]).str.cat())
               for idx in df2.index),
              axis=1).rename_axis(pd.Series(df2.index), axis=1).T
    
    Out[]:
              H1  H2  H3
    item-000  AA  AT  TT
    item-111  CC  CC  GC
    

    Of course this relies on both a for loop in the iterator and an apply, so it will not be very efficient.

    0 讨论(0)
  • 2021-01-14 18:32

    Solution
    Using pd.get_dummies and pd.DataFrame.dot

    df2.dot(pd.get_dummies(df1.stack()).T).sum(1, level=0)
    
              H1  H2  H3
    item-000  AA  AT  TT
    item-111  CC  CC  GC
    

    Explanation

    I know I want to use a dot product. The rule with the matrix multiplication is that an n x k matrix multiplied by a k x m matrix results in an n x m matrix. Looking at the final result, I see ['item-000', 'item-111'] in the index, that is my n in my n x k matrix. I look at my preliminary dataframes, do I have one with ['item-000', 'item-111'] in either the columns or index? I do!

    df2
    
               Alpha Bravo Charlie Delta
    item-000     A     A       T     T
    item-111     C     C       G     C
    

    and that implies my k is ['Alpha', 'Bravo', 'Charlie', 'Delta']. Ok, so now I must look for k x m. The only other dataframe I have is df1 and the things that look like ['Alpha', 'Bravo', 'Charlie', 'Delta'] are in the values... not the columns or index. So I must get it there. I decide to stack df1 and use pd.get_dummies.

    pd.get_dummies(df1.stack())
    
               Alpha  Bravo  Charlie  Delta
    H1 FIRST       1      0        0      0
       SECOND      0      1        0      0
    H2 FIRST       1      0        0      0
       SECOND      0      0        0      1
    H3 FIRST       0      0        1      0
       SECOND      0      0        0      1
    

    And now I have ['Alpha', 'Bravo', 'Charlie', 'Delta'] in my columns! That's my k. But I need it in my index. No problem, use transpose.

    pd.get_dummies(df1.stack()).T
    
               H1           H2           H3       
            FIRST SECOND FIRST SECOND FIRST SECOND
    Alpha       1      0     1      0     0      0
    Bravo       0      1     0      0     0      0
    Charlie     0      0     0      0     1      0
    Delta       0      0     0      1     0      1
    

    Right On! Now I'm ready to dot

    df2.dot(pd.get_dummies(df1.stack()).T)
    
                H1           H2           H3       
             FIRST SECOND FIRST SECOND FIRST SECOND
    item-000     A      A     A      T     T      T
    item-111     C      C     C      C     G      C
    

    We are almost there. I concatenate FIRST and SECOND by using pd.DataFrame.sum where I specify that I want to sum across rows and grouped by the first level of the columns object.

    df2.dot(pd.get_dummies(df1.stack()).T).sum(1, level=0)
    
              H1  H2  H3
    item-000  AA  AT  TT
    item-111  CC  CC  GC
    

    Setup

    df1 = pd.DataFrame(dict(
        FIRST=['Alpha', 'Alpha', 'Charlie'],
        SECOND=['Bravo', 'Delta', 'Delta']
    ), ['H1', 'H2', 'H3'])
    
    df2 = pd.DataFrame(dict(
        Alpha=['A', 'C'],
        Bravo=['A', 'C'],
        Delta=['T', 'C'],
        Charlie=['T', 'G']
    ), ['item-000', 'item-111'])
    
    0 讨论(0)
提交回复
热议问题