What are the efficient ways to loop over vectors along a specified axis in numpy ndarray?

前端 未结 2 772
广开言路
广开言路 2021-01-14 11:59

I\'m processing data by looping over vectors along an axis (could be any axis) of numpy ndarray (could be of any dimensions).

I didn\'t work on array directly becaus

2条回答
  •  逝去的感伤
    2021-01-14 12:49

    The best way to test efficiency is to do time tests on realistic examples. But %timeit (ipython) tests on toy examples are a start.

    Based on experience from answering similar 'if you must iterate' questions, there isn't much difference in times. np.frompyfunc has a modest speed edge - but its pyfunc takes scalars, not arrays or slices. (np.vectorize is a nicer API to this function, and a bit slower).

    But here you want to pass a 1d slice of an array to your function, while iterating over all the other dimensions. I don't think there's much difference in the alternative iteration methods.

    Actions like swapaxis, transpose and ravel are fast, often just creating a new view with different shape and strides.

    np.ndindex uses np.nditer (with the multindex flat) to iterate over a range of dimensions. nditer is fast when used in C code, but isn't anything special when used in Python code.

    np.apply_along_axis creates a (i,j,:,k) indexing tuple, and steps the variables. It's a nice general approach, but isn't doing anything special to speed things up. itertools.product is another way of generating the indices.

    But usually it isn't the iteration mechanism that slows things down, it's the repeated call to your function. You can test the iteration mechanism by using a trivial function, e.g.

    def foo(x):
       return x
    

    ===================

    You don't need to swapaxes to use ndindex; you can use it to iterate on any combination of axes.

    For example, make a 3d array, and sum along the middle dimension:

    In [495]: x=np.arange(2*3*4).reshape(2,3,4)
    
    In [496]: N=np.ndindex(2,4)
    
    In [497]: [x[i,:,k].sum() for i,k in N]
    Out[497]: [12, 15, 18, 21, 48, 51, 54, 57]
    
    In [498]: x.sum(1)
    Out[498]: 
    array([[12, 15, 18, 21],
           [48, 51, 54, 57]])
    

    I don't think it makes a difference in speed; the code's just simpler.

    ===================

    Another possible tool is np.ma, masked arrays. With those you mark individual elements as masked (because they are nan or 0). It has code that evaluates things like sum, mean, product in such a way that the masked values don't harm the solution.

    The 3d array again:

    In [517]: x=np.arange(2*3*4).reshape(2,3,4)
    

    add in some bad values:

    In [518]: x[1,1,2]=99    
    In [519]: x[0,0,:]=99
    

    those values mess up the normal sum:

    In [520]: x.sum(axis=1)
    Out[520]: 
    array([[111, 113, 115, 117],
           [ 48,  51, 135,  57]])
    

    but if we mask them, they are 'filtered out' of the solution (in this case, they are set temporarily to 0)

    In [521]: xm=np.ma.masked_greater(x,50)
    
    In [522]: xm
    Out[522]: 
    masked_array(data =
     [[[-- -- -- --]
      [4 5 6 7]
      [8 9 10 11]]
    
     [[12 13 14 15]
      [16 17 -- 19]
      [20 21 22 23]]],
                 mask =
     [[[ True  True  True  True]
     ...
      [False False False False]]],
           fill_value = 999999)
    
    In [523]: xm.sum(1)
    Out[523]: 
    masked_array(data =
     [[12 14 16 18]
     [48 51 36 57]],
     ...)
    

提交回复
热议问题