问题
I want to do the element-wise outer product of three (or four) large 2D arrays in python (values are float32 rounded to 2 decimals). They all have the same number of rows "n", but different number of columns "i", "j", "k".
The resulting array should be of shape (n, i*j*k). Then, I want to sum each column of the result to end up with a 1D array of shape (i*j*k).
np.shape(a) = (75466, 10)
np.shape(b) = (75466, 28)
np.shape(c) = (75466, 66)
np.shape(intermediate_result) = (75466, 18480)
np.shape(result) = (18480)
Thanks to ruankesi and divakar, I got a piece of code that works:
# Multiply first two matrices
first_multi = a[...,None] * b[:,None]
# could use np.einsum('ij,ik->ijk',a,b), which is slightly faster
ab_fills = first_multi.reshape(a.shape[0], a.shape[1]*b.shape[1])
# Multiply the result with the third matrix
second_multi = ab_fills[..., None] * c[:,None]
abc_fills = second_multi.reshape(ab_fills.shape[0], ab_fills.shape[1] * c.shape[1])
# Get the result: sum columns and get a 1D array of length 10*28*66 = 18 480
result = np.sum(abc_fills, axis = 0)
Problem 1: Performance
This takes about 3 seconds, but I have to repeat this operation many times and some of the matrices are even larger (in number of rows). It is acceptable but making it faster would be nice.
Problem 2: My matrices are sparse
Indeed, for instance, "a" contains 70% of 0s. I tried to play with scipy csc_matrix, but really could not get a working version. (to get the element-wise outer product here I go via a conversion to a 3D matrix, which are not supported in scipy sparse_matrix)
Problem 3: memory usage
If I try to also work with a 4th matrix, I run into memory issues.
I imagine that converting this code to sparse_matrix would save a lot of memory, and make the calculation faster by ignoring the numerous 0 values.
Is that true? If yes, can someone help me?
Of course, if you have any suggestion for a better implementation, I am also very interested. I don't need any of the intermediate results, just the final 1D result.
It's been weeks I'm stuck on this part of code, I am going nuts!
Thank you!
Edit after Divakar's answer
Approach #1:
Very nice one liner but surprisingly slower than the original approach (?).
On my test dataset, approach #1 takes 4.98 s ± 3.06 ms per loop (no speedup with optimize = True)
The original decomposed approach took 3.01 s ± 16.5 ms per loop
Approach #2:
Absolutely great, thank you! What an impressive speedup!
62.6 ms ± 233 µs per loop
About numexpr, I try to avoid as much as possible requirements for external modules, and I don't plan to use multicores/threads. This is an "embarrassingly" parallelizable task, with hundreds of thousands of objects to analyze, I'll just spread the list across available CPUs during production. I will give it a try for memory optimization.
As a brief try of numexpr with a restriction for 1 thread, performing 1 multiplication, I get a runtime of 40ms without numexpr, and 52 ms with numexpr.
Thanks again!!
回答1:
Approach #1
We can use np.einsum to do sum-reductions in one go -
result = np.einsum('ij,ik,il->jkl',a,b,c).ravel()
Also, play around with the optimize
flag in np.einsum
by setting it as True
to use BLAS.
Approach #2
We can use broadcasting
to do the first step as also mentioned in the posted code and then leverage tensor-matrix-multiplcation with np.tensordot
-
def broadcast_dot(a,b,c):
first_multi = a[...,None] * b[:,None]
return np.tensordot(first_multi,c, axes=(0,0)).ravel()
We can also use numexpr module that supports multi-core processing and also achieves better memory efficiency to get first_multi
. This gives us a modified solution, like so -
import numexpr as ne
def numexpr_broadcast_dot(a,b,c):
first_multi = ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
return np.tensordot(first_multi,c, axes=(0,0)).ravel()
Timings on random float data with given dataset sizes -
In [36]: %timeit np.einsum('ij,ik,il->jkl',a,b,c).ravel()
4.57 s ± 75.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit broadcast_dot(a,b,c)
270 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit numexpr_broadcast_dot(a,b,c)
172 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Just to give a sense of improvement with numexpr
-
In [7]: %timeit a[...,None] * b[:,None]
80.4 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
25.9 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This should be substantial when extending this solution to higher number of inputs.
来源:https://stackoverflow.com/questions/56493280/numpy-elementwise-outer-product-with-sparse-matrices