It is my understanding that the Outer Product of a vector with its transpose is symmetric in value.
Does Numpy take this into account to only do the multiplications
Exploring some alternatives:
In [162]: x=np.arange(100)
In [163]: np.outer(x,x)
Out[163]:
array([[ 0, 0, 0, ..., 0, 0, 0],
[ 0, 1, 2, ..., 97, 98, 99],
[ 0, 2, 4, ..., 194, 196, 198],
...,
[ 0, 97, 194, ..., 9409, 9506, 9603],
[ 0, 98, 196, ..., 9506, 9604, 9702],
[ 0, 99, 198, ..., 9603, 9702, 9801]])
In [164]: x1=x[:,None]
In [165]: x1*x1.T
Out[165]:
array([[ 0, 0, 0, ..., 0, 0, 0],
[ 0, 1, 2, ..., 97, 98, 99],
[ 0, 2, 4, ..., 194, 196, 198],
...,
[ 0, 97, 194, ..., 9409, 9506, 9603],
[ 0, 98, 196, ..., 9506, 9604, 9702],
[ 0, 99, 198, ..., 9603, 9702, 9801]])
In [166]: np.dot(x1,x1.T)
Out[166]:
array([[ 0, 0, 0, ..., 0, 0, 0],
[ 0, 1, 2, ..., 97, 98, 99],
[ 0, 2, 4, ..., 194, 196, 198],
...,
[ 0, 97, 194, ..., 9409, 9506, 9603],
[ 0, 98, 196, ..., 9506, 9604, 9702],
[ 0, 99, 198, ..., 9603, 9702, 9801]])
Comparing their times:
In [167]: timeit np.outer(x,x)
40.8 µs ± 63.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [168]: timeit x1*x1.T
36.3 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [169]: timeit np.dot(x1,x1.T)
60.7 µs ± 6.86 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Is dot
using a transpose short cut? I don't think so, or if it does, it doesn't help in this case. I'm a little surprised that dot
is slower.
In [170]: x2=x1.T
In [171]: timeit np.dot(x1,x2)
61.1 µs ± 30 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Another method
In [172]: timeit np.einsum('i,j',x,x)
28.3 µs ± 19.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
einsum
with x1
and x2
has the same times.
Interesting that matmul
does as well as einsum
in this case (maybe einsum
is delegating to matmul
?)
In [178]: timeit x1@x2
27.3 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [179]: timeit x1@x1.T
27.2 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Numpy efficient matrix self-multiplication (gram matrix) demonstrates how dot
can be save time by being clever (for a 1000x1000 array).
As discussed in the links, dot
can detect when one argument is the transpose of the other (probably by checking the data buffer pointer and shape and strides), and can use a BLAS function optimized for symmetric calculations. But I don't see evidence of outer
doing that. And its unlikely that broadcasted multiplication would take such a step.