At the heart of an application (written in Python and using NumPy) I need to rotate a 4th order tensor. Actually, I need to rotate a lot of tensors many times and this is my
Here is how to do it with a single Python loop:
def rotT(T, g):
Tprime = T
for i in range(4):
slices = [None] * 4
slices[i] = slice(None)
slices *= 2
Tprime = g[slices].T * Tprime
return Tprime.sum(-1).sum(-1).sum(-1).sum(-1)
Admittedly, this is a bit hard to grasp at first glance, but it's quite a bit faster :)
Thanks to hard work by M. Wiebe, the next version of Numpy (which will probably be 1.6) is going to make this even easier:
>>> Trot = np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)
Philipp's approach is at the moment 3x faster, though, but perhaps there is some room for improvement. The speed difference is probably mostly due to tensordot being able to unroll the whole operation as a single matrix product that can be passed on to BLAS, and so avoiding much of the overhead associated with small arrays --- this is not possible for general Einstein summation, as not all operations that can be expressed in this form resolve to a single matrix product.
Out of curiosity I've compared Cython implementation of a naive code from the question with the numpy code from @Philipp's answer. Cython code is four times faster on my machine:
#cython: boundscheck=False, wraparound=False
import numpy as np
cimport numpy as np
def rotT(np.ndarray[np.float64_t, ndim=4] T,
np.ndarray[np.float64_t, ndim=2] g):
cdef np.ndarray[np.float64_t, ndim=4] Tprime
cdef Py_ssize_t i, j, k, l, ii, jj, kk, ll
cdef np.float64_t gg
Tprime = np.zeros((3,3,3,3), dtype=T.dtype)
for i in range(3):
for j in range(3):
for k in range(3):
for l in range(3):
for ii in range(3):
for jj in range(3):
for kk in range(3):
for ll in range(3):
gg = g[ii,i]*g[jj,j]*g[kk,k]*g[ll,l]
Tprime[i,j,k,l] = Tprime[i,j,k,l] + \
gg*T[ii,jj,kk,ll]
return Tprime
Not a new answer, as all the previous ones deal well with the question. More like a comment, but I post it as an answer to have some space for the code.
While all answers do reproduce the result of the original post, I am pretty sure that the code provided in the original post is wrong. Looking at the formula T'ijkl = Σ gia gjb gkc gld Tabcd, which I believe is correct, the indices for g that are varied in the calculation of each entry of T' are a, b, c & d. However, in the original provided code, the indices used to access the values of g in the calculation of gg are swapped with regard to the formula. Hence, I believe the following code actually provides the correct implementation of the formula:
def rotT(T, g):
Tprime = np.zeros((3, 3, 3, 3))
for i in range(3):
for j in range(3):
for k in range(3):
for l in range(3):
for a in range(3):
for b in range(3):
for c in range(3):
for d in range(3):
Tprime[i, j, k, l] += \
g[i, a] * g[j, b] * \
g[k, c] * g[l, d] * T[a, b, c, d]
The equivalent, but faster, calls to einsum and tensordot update to:
Tprime = np.tensordot(g, np.tensordot(g, np.tensordot(
g, np.tensordot(g, T, (1, 3)), (1, 3)), (1, 3)), (1, 3))
Tprime = np.einsum('ia, jb, kc, ld, abcd->ijkl', g, g, g, g, T)
Additionally, using @jit(nopython=True)
from numba on the naive loops function is five times faster than using numpy.tensordot
on my machine.
For memory efficiency and thereafter performance efficiency, we could use tensor matrix-multiplication in steps.
To illustrate the steps involved, let's use the simplest of the solutions with np.einsum by @pv. -
np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)
As seen, we are losing the first dimension from g
with tensor-multiplication between its four variants and T
.
Let's do those sum-reductions for tensor matrix multiplications in steps. Let's start off with the first variant of g
and T
:
p1 = np.einsum('abcd, ai->bcdi', T, g)
Thus, we end up with a tensor of dimensions as string notation : bcdi
. The next steps would involve sum-reducing this tensor against the rest of the three g
variants as used in the original einsum
implmentation. Hence, the next reduction would be -
p2 = np.einsum('bcdi, bj->cdij', p1, g)
As seen, we have lost the first two dimensions with the string notations : a
, b
. We continue it for two more steps to get rid of c
and d
too and would be left with ijkl
as the final output, like so -
p3 = np.einsum('cdij, ck->dijk', p2, g)
p4 = np.einsum('dijk, dl->ijkl', p3, g)
Now, we could use np.tensordot for these sum-reductions, which would be much more efficient.
Final implementation
Thus, porting over to np.tensordot
, we would have the final implementation like so -
p1 = np.tensordot(T,g,axes=((0),(0)))
p2 = np.tensordot(p1,g,axes=((0),(0)))
p3 = np.tensordot(p2,g,axes=((0),(0)))
out = np.tensordot(p3,g,axes=((0),(0)))
Let's test out all the NumPy based approaches posted across other posts to solve the problem on performance.
Approaches as functions -
def rotT_Philipp(T, g): # @Philipp's soln
gg = np.outer(g, g)
gggg = np.outer(gg, gg).reshape(4 * g.shape)
axes = ((0, 2, 4, 6), (0, 1, 2, 3))
return np.tensordot(gggg, T, axes)
def rotT_Sven(T, g): # @Sven Marnach's soln
Tprime = T
for i in range(4):
slices = [None] * 4
slices[i] = slice(None)
slices *= 2
Tprime = g[slices].T * Tprime
return Tprime.sum(-1).sum(-1).sum(-1).sum(-1)
def rotT_pv(T, g): # @pv.'s soln
return np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)
def rotT_Divakar(T,g): # Posted in this post
p1 = np.tensordot(T,g,axes=((0),(0)))
p2 = np.tensordot(p1,g,axes=((0),(0)))
p3 = np.tensordot(p2,g,axes=((0),(0)))
p4 = np.tensordot(p3,g,axes=((0),(0)))
return p4
Timings with the original dataset sizes -
In [304]: # Setup inputs
...: T = np.random.rand(3,3,3,3)
...: g = np.random.rand(3,3)
...:
In [305]: %timeit rotT(T, g)
...: %timeit rotT_pv(T, g)
...: %timeit rotT_Sven(T, g)
...: %timeit rotT_Philipp(T, g)
...: %timeit rotT_Divakar(T, g)
...:
100 loops, best of 3: 6.51 ms per loop
1000 loops, best of 3: 247 µs per loop
10000 loops, best of 3: 137 µs per loop
10000 loops, best of 3: 41.6 µs per loop
10000 loops, best of 3: 28.3 µs per loop
In [306]: 6510.0/28.3 # Speedup with the proposed soln over original code
Out[306]: 230.03533568904592
As discussed at the start of this post, we are trying to achieve memory efficiency and hence performance boost with it. Let's test that out as we increase the dataset sizes -
In [307]: # Setup inputs
...: T = np.random.rand(5,5,5,5)
...: g = np.random.rand(5,5)
...:
In [308]: %timeit rotT(T, g)
...: %timeit rotT_pv(T, g)
...: %timeit rotT_Sven(T, g)
...: %timeit rotT_Philipp(T, g)
...: %timeit rotT_Divakar(T, g)
...:
100 loops, best of 3: 6.54 ms per loop
100 loops, best of 3: 7.17 ms per loop
100 loops, best of 3: 2.7 ms per loop
1000 loops, best of 3: 1.47 ms per loop
10000 loops, best of 3: 39.9 µs per loop
To use tensordot
, compute the outer product of the g
tensors:
def rotT(T, g):
gg = np.outer(g, g)
gggg = np.outer(gg, gg).reshape(4 * g.shape)
axes = ((0, 2, 4, 6), (0, 1, 2, 3))
return np.tensordot(gggg, T, axes)
On my system, this is around seven times faster than Sven's solution. If the g
tensor doesn't change often, you can also cache the gggg
tensor. If you do this and turn on some micro-optimizations (inlining the tensordot
code, no checks, no generic shapes), you can still make it two times faster:
def rotT(T, gggg):
return np.dot(gggg.transpose((1, 3, 5, 7, 0, 2, 4, 6)).reshape((81, 81)),
T.reshape(81, 1)).reshape((3, 3, 3, 3))
Results of timeit
on my home laptop (500 iterations):
Your original code: 19.471129179
Sven's code: 0.718412876129
My first code: 0.118047952652
My second code: 0.0690279006958
The numbers on my work machine are:
Your original code: 9.77922987938
Sven's code: 0.137110948563
My first code: 0.0569641590118
My second code: 0.0308079719543