Fast tensor rotation with NumPy

后端 未结 7 1164
情话喂你
情话喂你 2020-12-04 17:35

At the heart of an application (written in Python and using NumPy) I need to rotate a 4th order tensor. Actually, I need to rotate a lot of tensors many times and this is my

相关标签:
7条回答
  • 2020-12-04 18:10

    Here is how to do it with a single Python loop:

    def rotT(T, g):
        Tprime = T
        for i in range(4):
            slices = [None] * 4
            slices[i] = slice(None)
            slices *= 2
            Tprime = g[slices].T * Tprime
        return Tprime.sum(-1).sum(-1).sum(-1).sum(-1)
    

    Admittedly, this is a bit hard to grasp at first glance, but it's quite a bit faster :)

    0 讨论(0)
  • 2020-12-04 18:12

    Thanks to hard work by M. Wiebe, the next version of Numpy (which will probably be 1.6) is going to make this even easier:

    >>> Trot = np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)
    

    Philipp's approach is at the moment 3x faster, though, but perhaps there is some room for improvement. The speed difference is probably mostly due to tensordot being able to unroll the whole operation as a single matrix product that can be passed on to BLAS, and so avoiding much of the overhead associated with small arrays --- this is not possible for general Einstein summation, as not all operations that can be expressed in this form resolve to a single matrix product.

    0 讨论(0)
  • 2020-12-04 18:16

    Out of curiosity I've compared Cython implementation of a naive code from the question with the numpy code from @Philipp's answer. Cython code is four times faster on my machine:

    #cython: boundscheck=False, wraparound=False
    import numpy as np
    cimport numpy as np
    
    def rotT(np.ndarray[np.float64_t, ndim=4] T,
             np.ndarray[np.float64_t, ndim=2] g):
        cdef np.ndarray[np.float64_t, ndim=4] Tprime
        cdef Py_ssize_t i, j, k, l, ii, jj, kk, ll
        cdef np.float64_t gg
    
        Tprime = np.zeros((3,3,3,3), dtype=T.dtype)
        for i in range(3):
            for j in range(3):
                for k in range(3):
                    for l in range(3):
                        for ii in range(3):
                            for jj in range(3):
                                for kk in range(3):
                                    for ll in range(3):
                                        gg = g[ii,i]*g[jj,j]*g[kk,k]*g[ll,l]
                                        Tprime[i,j,k,l] = Tprime[i,j,k,l] + \
                                             gg*T[ii,jj,kk,ll]
        return Tprime
    
    0 讨论(0)
  • 2020-12-04 18:19

    Not a new answer, as all the previous ones deal well with the question. More like a comment, but I post it as an answer to have some space for the code.

    While all answers do reproduce the result of the original post, I am pretty sure that the code provided in the original post is wrong. Looking at the formula T'ijkl = Σ gia gjb gkc gld Tabcd, which I believe is correct, the indices for g that are varied in the calculation of each entry of T' are a, b, c & d. However, in the original provided code, the indices used to access the values of g in the calculation of gg are swapped with regard to the formula. Hence, I believe the following code actually provides the correct implementation of the formula:

    def rotT(T, g):
        Tprime = np.zeros((3, 3, 3, 3))
        for i in range(3):
            for j in range(3):
                for k in range(3):
                    for l in range(3):
                        for a in range(3):
                            for b in range(3):
                                for c in range(3):
                                    for d in range(3):
                                        Tprime[i, j, k, l] += \
                                            g[i, a] * g[j, b] * \
                                            g[k, c] * g[l, d] * T[a, b, c, d]
    

    The equivalent, but faster, calls to einsum and tensordot update to:

    Tprime = np.tensordot(g, np.tensordot(g, np.tensordot(
        g, np.tensordot(g, T, (1, 3)), (1, 3)), (1, 3)), (1, 3))
    Tprime = np.einsum('ia, jb, kc, ld, abcd->ijkl', g, g, g, g, T)
    

    Additionally, using @jit(nopython=True) from numba on the naive loops function is five times faster than using numpy.tensordot on my machine.

    0 讨论(0)
  • 2020-12-04 18:21

    Prospective Approach and solution code

    For memory efficiency and thereafter performance efficiency, we could use tensor matrix-multiplication in steps.

    To illustrate the steps involved, let's use the simplest of the solutions with np.einsum by @pv. -

    np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)
    

    As seen, we are losing the first dimension from g with tensor-multiplication between its four variants and T.

    Let's do those sum-reductions for tensor matrix multiplications in steps. Let's start off with the first variant of g and T :

    p1 = np.einsum('abcd, ai->bcdi', T, g)
    

    Thus, we end up with a tensor of dimensions as string notation : bcdi. The next steps would involve sum-reducing this tensor against the rest of the three g variants as used in the original einsum implmentation. Hence, the next reduction would be -

    p2 = np.einsum('bcdi, bj->cdij', p1, g)
    

    As seen, we have lost the first two dimensions with the string notations : a, b. We continue it for two more steps to get rid of c and d too and would be left with ijkl as the final output, like so -

    p3 = np.einsum('cdij, ck->dijk', p2, g)
    
    p4 = np.einsum('dijk, dl->ijkl', p3, g)
    

    Now, we could use np.tensordot for these sum-reductions, which would be much more efficient.

    Final implementation

    Thus, porting over to np.tensordot, we would have the final implementation like so -

    p1 = np.tensordot(T,g,axes=((0),(0)))
    p2 = np.tensordot(p1,g,axes=((0),(0)))
    p3 = np.tensordot(p2,g,axes=((0),(0)))
    out = np.tensordot(p3,g,axes=((0),(0)))
    

    Runtime test

    Let's test out all the NumPy based approaches posted across other posts to solve the problem on performance.

    Approaches as functions -

    def rotT_Philipp(T, g):  # @Philipp's soln
        gg = np.outer(g, g)
        gggg = np.outer(gg, gg).reshape(4 * g.shape)
        axes = ((0, 2, 4, 6), (0, 1, 2, 3))
        return np.tensordot(gggg, T, axes)
    
    def rotT_Sven(T, g):    # @Sven Marnach's soln
        Tprime = T
        for i in range(4):
            slices = [None] * 4
            slices[i] = slice(None)
            slices *= 2
            Tprime = g[slices].T * Tprime
        return Tprime.sum(-1).sum(-1).sum(-1).sum(-1)    
    
    def rotT_pv(T, g):     # @pv.'s soln
        return np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)
    
    def rotT_Divakar(T,g): # Posted in this post
        p1 = np.tensordot(T,g,axes=((0),(0)))
        p2 = np.tensordot(p1,g,axes=((0),(0)))
        p3 = np.tensordot(p2,g,axes=((0),(0)))
        p4 = np.tensordot(p3,g,axes=((0),(0)))
        return p4
    

    Timings with the original dataset sizes -

    In [304]: # Setup inputs 
         ...: T = np.random.rand(3,3,3,3)
         ...: g = np.random.rand(3,3)
         ...: 
    
    In [305]: %timeit rotT(T, g)
         ...: %timeit rotT_pv(T, g)
         ...: %timeit rotT_Sven(T, g)
         ...: %timeit rotT_Philipp(T, g)
         ...: %timeit rotT_Divakar(T, g)
         ...: 
    100 loops, best of 3: 6.51 ms per loop
    1000 loops, best of 3: 247 µs per loop
    10000 loops, best of 3: 137 µs per loop
    10000 loops, best of 3: 41.6 µs per loop
    10000 loops, best of 3: 28.3 µs per loop
    
    In [306]: 6510.0/28.3 # Speedup with the proposed soln over original code
    Out[306]: 230.03533568904592
    

    As discussed at the start of this post, we are trying to achieve memory efficiency and hence performance boost with it. Let's test that out as we increase the dataset sizes -

    In [307]: # Setup inputs 
         ...: T = np.random.rand(5,5,5,5)
         ...: g = np.random.rand(5,5)
         ...: 
    
    In [308]: %timeit rotT(T, g)
         ...: %timeit rotT_pv(T, g)
         ...: %timeit rotT_Sven(T, g)
         ...: %timeit rotT_Philipp(T, g)
         ...: %timeit rotT_Divakar(T, g)
         ...: 
    100 loops, best of 3: 6.54 ms per loop
    100 loops, best of 3: 7.17 ms per loop
    100 loops, best of 3: 2.7 ms per loop
    1000 loops, best of 3: 1.47 ms per loop
    10000 loops, best of 3: 39.9 µs per loop
    
    0 讨论(0)
  • 2020-12-04 18:30

    To use tensordot, compute the outer product of the g tensors:

    def rotT(T, g):
        gg = np.outer(g, g)
        gggg = np.outer(gg, gg).reshape(4 * g.shape)
        axes = ((0, 2, 4, 6), (0, 1, 2, 3))
        return np.tensordot(gggg, T, axes)
    

    On my system, this is around seven times faster than Sven's solution. If the g tensor doesn't change often, you can also cache the gggg tensor. If you do this and turn on some micro-optimizations (inlining the tensordot code, no checks, no generic shapes), you can still make it two times faster:

    def rotT(T, gggg):
        return np.dot(gggg.transpose((1, 3, 5, 7, 0, 2, 4, 6)).reshape((81, 81)),
                      T.reshape(81, 1)).reshape((3, 3, 3, 3))
    

    Results of timeit on my home laptop (500 iterations):

    Your original code: 19.471129179
    Sven's code: 0.718412876129
    My first code: 0.118047952652
    My second code: 0.0690279006958
    

    The numbers on my work machine are:

    Your original code: 9.77922987938
    Sven's code: 0.137110948563
    My first code: 0.0569641590118
    My second code: 0.0308079719543
    
    0 讨论(0)
提交回复
热议问题