numpy: efficient, large dot products

后端 未结 2 757
抹茶落季
抹茶落季 2020-12-03 19:36

I am trying to perform a large linear-algebra computation to transform a generic covariance matrix KK_l_obs (shape (NL, NL))into a map of covarianc

相关标签:
2条回答
  • 2020-12-03 20:07

    On a relatively modest machine (4G memory) a matmul calc on the whole 10x10x1000x1000 space works.

    def looping2(n=2):
        ktemp = np.empty((n,n,nl,nl))
        for i,j in np.ndindex(ktemp.shape[:2]):
            I0_ = I0[i, j]
            temp = KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl]
            temp = temp / a_map[i,j] + k_l_th
            ktemp[i,j,...] = temp
        K_PC = E @ ktemp @ E.T      
        return K_PC
    
    K = loop()
    k4 = looping2(n=X)
    np.allclose(k4, K.transpose(2,3,0,1))  # true
    

    I haven't tried to vectorize the IO_ mapping. My focus is on generalizing the double dot product.

    The equivalent einsum is:

    K_PC = np.einsum('ij,...jk,lk->il...', E, ktemp, E) 
    

    That produces a ValueError: iterator is too large error for n=7.

    But with the latest version

    K_PC = np.einsum('ij,...jk,lk->il...', E, ktemp, E, optimize='optimal')
    

    does work for the full 7x7x10x10 output.

    Timings aren't promising. 2.2sec for the original looping, 3.9s for the big matmul (or einsum). (I get the same 2x speedup with original_mod_app)

    ============

    time for constructing a (10,10,1000,1000) array (iteratively):

    In [31]: %%timeit 
        ...:     ktemp = np.empty((n,n,nl,nl))
        ...:     for i,j in np.ndindex(ktemp.shape[:2]):
        ...:         I0_ = I0[i, j]
        ...:         temp = KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl]
        ...:         ktemp[i,j,...] = temp
        ...:     
    1 loop, best of 3: 749 ms per loop
    

    time for reducing that to (10,10,7,7) with @ (longer than the construction)

    In [32]: timeit E @ ktemp @ E.T
    1 loop, best of 3: 1.17 s per loop
    

    time for the same two operations, but with the reduction in the loop

    In [33]: %%timeit 
        ...:     ktemp = np.empty((n,n,q,q))
        ...:     for i,j in np.ndindex(ktemp.shape[:2]):
        ...:         I0_ = I0[i, j]
        ...:         temp = KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl]
        ...:         ktemp[i,j,...] = E @ temp @ E.T
    
    1 loop, best of 3: 858 ms per loop
    

    Performing the dot product within the loop reduces the size of the subarrays that are saved to ktemp, thus making up for the calculation cost. The dot operation on the big array is, by itself, more expensive than your loop. Even if we could 'vectorize' KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] it wouldn't make up for the cost handling that big array.

    0 讨论(0)
  • 2020-12-03 20:11

    Tweak #1

    One very simple performance tweak that's mostly ignored in NumPy is avoiding the use of division and using multiplication. This is not noticeable when dealing with scalar to scalar or array to array divisions when dealing with equal shaped arrays. But NumPy's implicit broadcasting makes it interesting for divisions that allow for broadcasting between arrays of different shapes or between an array and scalar. For those cases, we could get noticeable boost using multiplication with the reciprocal numbers. Thus, for the stated problem, we would pre-compute the reciprocal of a_map and use those for multiplications in place of divisions.

    So, at the start do :

    r_a_map = 1.0/a_map
    

    Then, within the nested loops, use it as :

    KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] * r_a_map[si[0], si[1]]
    

    Tweak #2

    We could use associative property of multiplication there :

    A*(B + C) = A*B + A*C
    

    Thus, k_l_th that is summed across all iterations but stays constant could be taken outside of the loop and summed up after getting out of the nested loops. It's effective summation would be : E.dot(k_l_th).dot(E.T). So, we would add this to K_PC.


    Finalizing and benchmarking

    Using tweak #1 and tweak#2, we would end up with a modified approach, like so -

    def original_mod_app():
        r_a_map = 1.0/a_map
        K_PC = np.empty((q, q, X, Y))
        inds = np.ndindex((X, Y))
        for si in inds:
            I0_ = I0[si[0], si[1]]
            K_PC[..., si[0], si[1]] = E.dot(
                KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] * \
                r_a_map[si[0], si[1]]).dot(E.T)
        return K_PC + E.dot(k_l_th).dot(E.T)[:,:,None,None]
    

    Runtime test with the same sample setup as used in the question -

    In [458]: %timeit original_app()
    1 loops, best of 3: 1.4 s per loop
    
    In [459]: %timeit original_mod_app()
    1 loops, best of 3: 677 ms per loop
    
    In [460]: np.allclose(original_app(), original_mod_app())
    Out[460]: True
    

    So, we are getting a speedup of 2x+ there.

    0 讨论(0)
提交回复
热议问题