问题
Consider the following dataframes d1
and d1
d1 = pd.DataFrame([
[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[1, 2, 3],
[2, 3, 4],
[3, 4, 5]
], columns=list('ABC'))
d2 = pd.get_dummies(list('XYZZXY'))
d1
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 1 2 3
4 2 3 4
5 3 4 5
d2
X Y Z
0 1 0 0
1 0 1 0
2 0 0 1
3 0 0 1
4 1 0 0
5 0 1 0
I need to get a new dataframe with a multi-index columns object that has the product of every combination of columns from d1
and d2
So far I've done this...
from itertools import product
pd.concat({(x, y): d1[x] * d2[y] for x, y in product(d1, d2)}, axis=1)
A B C
X Y Z X Y Z X Y Z
0 1 0 0 2 0 0 3 0 0
1 0 2 0 0 3 0 0 4 0
2 0 0 3 0 0 4 0 0 5
3 0 0 1 0 0 2 0 0 3
4 2 0 0 3 0 0 4 0 0
5 0 3 0 0 4 0 0 5 0
There is nothing wrong with this method. But I'm looking for alternatives to evaluate.
Inspired by Yakym Pirozhenko
m, n = len(d1.columns), len(d2.columns)
lvl0 = np.repeat(np.arange(m), n)
lvl1 = np.tile(np.arange(n), m)
v1, v2 = d1.values, d2.values
pd.DataFrame(
v1[:, lvl0] * v2[:, lvl1],
d1.index,
pd.MultiIndex.from_tuples(list(zip(d1.columns[lvl0], d2.columns[lvl1])))
)
However, this is a more clumsy implementation of numpy broadcasting which is better covered by Divakar.
Timing
All answers were good answers and demonstrate different aspects of pandas and numpy. Please consider up-voting them if you found them useful and informative.
%%timeit
m, n = len(d1.columns), len(d2.columns)
lvl0 = np.repeat(np.arange(m), n)
lvl1 = np.tile(np.arange(n), m)
v1, v2 = d1.values, d2.values
pd.DataFrame(
v1[:, lvl0] * v2[:, lvl1],
d1.index,
pd.MultiIndex.from_tuples(list(zip(d1.columns[lvl0], d2.columns[lvl1])))
)
%%timeit
vals = (d2.values[:,None,:] * d1.values[:,:,None]).reshape(d1.shape[0],-1)
cols = pd.MultiIndex.from_product([d1.columns, d2.columns])
pd.DataFrame(vals, columns=cols, index=d1.index)
%timeit d1.apply(lambda x: d2.mul(x, axis=0).stack()).unstack()
%timeit pd.concat({x : d2.mul(d1[x], axis=0) for x in d1.columns}, axis=1)
%timeit pd.concat({(x, y): d1[x] * d2[y] for x, y in product(d1, d2)}, axis=1)
1000 loops, best of 3: 663 µs per loop
1000 loops, best of 3: 624 µs per loop
100 loops, best of 3: 3.38 ms per loop
1000 loops, best of 3: 860 µs per loop
100 loops, best of 3: 2.01 ms per loop
回答1:
Here's one approach with NumPy broadcasting -
vals = (d2.values[:,None,:] * d1.values[:,:,None]).reshape(d1.shape[0],-1)
cols = pd.MultiIndex.from_product([d1.columns, d2.columns])
df_out = pd.DataFrame(vals, columns=cols, index=d1.index)
Sample run -
In [92]: d1
Out[92]:
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 1 2 3
4 2 3 4
5 3 4 5
In [93]: d2
Out[93]:
X Y Z
0 1 0 0
1 0 1 0
2 0 0 1
3 0 0 1
4 1 0 0
5 0 1 0
In [110]: vals = (d2.values[:,None,:] * d1.values[:,:,None]).reshape(d1.shape[0],-1)
...: cols = pd.MultiIndex.from_product([d1.columns, d2.columns])
...: df_out = pd.DataFrame(vals, columns=cols, index=d1.index)
...:
In [111]: df_out
Out[111]:
A B C
X Y Z X Y Z X Y Z
0 1 0 0 2 0 0 3 0 0
1 0 2 0 0 3 0 0 4 0
2 0 0 3 0 0 4 0 0 5
3 0 0 1 0 0 2 0 0 3
4 2 0 0 3 0 0 4 0 0
5 0 3 0 0 4 0 0 5 0
回答2:
Here's a bit vectorized version. There could be a better way.
In [846]: pd.concat({x : d2.mul(d1[x], axis=0) for x in d1.columns}, axis=1)
Out[846]:
A B C
X Y Z X Y Z X Y Z
0 1 0 0 2 0 0 3 0 0
1 0 2 0 0 3 0 0 4 0
2 0 0 3 0 0 4 0 0 5
3 0 0 1 0 0 2 0 0 3
4 2 0 0 3 0 0 4 0 0
5 0 3 0 0 4 0 0 5 0
回答3:
Here is a one-liner that uses pandas stack and unstack method.
The "trick" is to use stack
, so that the result of each computation within apply
is a time series. Then use unstack
to obtain the Multiindex
form.
d1.apply(lambda x: d2.mul(x, axis=0).stack()).unstack()
Which gives:
A B C
X Y Z X Y Z X Y Z
0 1.0 0.0 0.0 2.0 0.0 0.0 3.0 0.0 0.0
1 0.0 2.0 0.0 0.0 3.0 0.0 0.0 4.0 0.0
2 0.0 0.0 3.0 0.0 0.0 4.0 0.0 0.0 5.0
3 0.0 0.0 1.0 0.0 0.0 2.0 0.0 0.0 3.0
4 2.0 0.0 0.0 3.0 0.0 0.0 4.0 0.0 0.0
5 0.0 3.0 0.0 0.0 4.0 0.0 0.0 5.0 0.0
回答4:
You could get the multi-index first, use it to obtain the shapes and multiply directly.
cols = pd.MultiIndex.from_tuples(
[(c1, c2) for c1 in d1.columns for c2 in d2.columns])
a = d1.loc[:,cols.get_level_values(0)]
b = d2.loc[:,cols.get_level_values(1)]
a.columns = b.columns = cols
res = a * b
来源:https://stackoverflow.com/questions/45285319/pandas-outer-product-of-two-dataframes-with-same-index