My goal is to concatenate columns in a dataframe(Source), based on pairs that are described in a separate dataframe(Reference). The resulting dataframe should replace the c
There will likely be better solutions, but at least this one is working:
import pandas as pd
df1 = pd.DataFrame({'FIRST': pd.Series(['Alpha', 'Alpha', 'Charlie'],
index=['H1', 'H2', 'H3']),
'SECOND': pd.Series(['Bravo', 'Delta', 'Delta'],
index=['H1', 'H2', 'H3'])})
df2 = pd.DataFrame({'Alpha' : pd.Series(['A', 'C'], index = ['item-000', 'item-111']),
'Bravo' : pd.Series(['A', 'C'], index = ['item-000', 'item-111']),
'Delta' : pd.Series(['T', 'C'], index = ['item-000', 'item-111']),
'Charlie' : pd.Series(['T', 'G'], index = ['item-000', 'item-111'])})
pd.concat((df1.T.apply(lambda x: x.map(df2.loc[idx]).str.cat())
for idx in df2.index),
axis=1).rename_axis(pd.Series(df2.index), axis=1).T
Out[]:
H1 H2 H3
item-000 AA AT TT
item-111 CC CC GC
Of course this relies on both a for
loop in the iterator and an apply
, so it will not be very efficient.
Solution
Using pd.get_dummies
and pd.DataFrame.dot
df2.dot(pd.get_dummies(df1.stack()).T).sum(1, level=0)
H1 H2 H3
item-000 AA AT TT
item-111 CC CC GC
Explanation
I know I want to use a dot product. The rule with the matrix multiplication is that an n x k
matrix multiplied by a k x m
matrix results in an n x m
matrix. Looking at the final result, I see ['item-000', 'item-111']
in the index, that is my n
in my n x k
matrix. I look at my preliminary dataframes, do I have one with ['item-000', 'item-111']
in either the columns or index? I do!
df2
Alpha Bravo Charlie Delta
item-000 A A T T
item-111 C C G C
and that implies my k
is ['Alpha', 'Bravo', 'Charlie', 'Delta']
. Ok, so now I must look for k x m
. The only other dataframe I have is df1
and the things that look like ['Alpha', 'Bravo', 'Charlie', 'Delta']
are in the values... not the columns or index. So I must get it there. I decide to stack df1
and use pd.get_dummies
.
pd.get_dummies(df1.stack())
Alpha Bravo Charlie Delta
H1 FIRST 1 0 0 0
SECOND 0 1 0 0
H2 FIRST 1 0 0 0
SECOND 0 0 0 1
H3 FIRST 0 0 1 0
SECOND 0 0 0 1
And now I have ['Alpha', 'Bravo', 'Charlie', 'Delta']
in my columns! That's my k
. But I need it in my index. No problem, use transpose.
pd.get_dummies(df1.stack()).T
H1 H2 H3
FIRST SECOND FIRST SECOND FIRST SECOND
Alpha 1 0 1 0 0 0
Bravo 0 1 0 0 0 0
Charlie 0 0 0 0 1 0
Delta 0 0 0 1 0 1
Right On! Now I'm ready to dot
df2.dot(pd.get_dummies(df1.stack()).T)
H1 H2 H3
FIRST SECOND FIRST SECOND FIRST SECOND
item-000 A A A T T T
item-111 C C C C G C
We are almost there. I concatenate FIRST
and SECOND
by using pd.DataFrame.sum
where I specify that I want to sum across rows and grouped by the first level of the columns object.
df2.dot(pd.get_dummies(df1.stack()).T).sum(1, level=0)
H1 H2 H3
item-000 AA AT TT
item-111 CC CC GC
Setup
df1 = pd.DataFrame(dict(
FIRST=['Alpha', 'Alpha', 'Charlie'],
SECOND=['Bravo', 'Delta', 'Delta']
), ['H1', 'H2', 'H3'])
df2 = pd.DataFrame(dict(
Alpha=['A', 'C'],
Bravo=['A', 'C'],
Delta=['T', 'C'],
Charlie=['T', 'G']
), ['item-000', 'item-111'])