vectorize numpy unique for subarrays

前端 未结 1 1467
误落风尘
误落风尘 2021-01-05 22:09

I have a numpy array data of shape (N, 20, 20) with N being some very large number. I want to get the number of unique values in each of the 20x20 sub-arrays. with a loop th

相关标签:
1条回答
  • 2021-01-05 22:42

    First, you can work with data.reshape(N,-1), since you are interested in sorting the last 2 dimensions.

    An easy way to get the number of unique values for each row is to dump each row into a set and let it do the sorting:

    [len(set(i)) for i in data.reshape(data.shape[0],-1)]
    

    But this is an iteration, through probably a fast one.

    A problem with 'vectorizing' is that the set or list of unique values in each row will differ in length. 'rows with differing length' is a red flag when it comes to 'vectorizing'. You no longer have the 'rectangular' data layout that makes most vectorizing possible.

    You could sort each row:

    np.sort(data.reshape(N,-1))
    
    array([[1, 2, 2, 3, 3, 5, 5, 5, 6, 6],
           [1, 1, 1, 2, 2, 2, 3, 3, 5, 7],
           [0, 0, 2, 3, 4, 4, 4, 5, 5, 9],
           [2, 2, 3, 3, 4, 4, 5, 7, 8, 9],
           [0, 2, 2, 2, 2, 5, 5, 5, 7, 9]])
    

    But how do you identify the unique values in each row without iterating? Counting the number of nonzero differences might just do the trick:

    In [530]: data=np.random.randint(10,size=(5,10))
    
    In [531]: [len(set(i)) for i in data.reshape(data.shape[0],-1)]
    Out[531]: [7, 6, 6, 8, 6]
    
    In [532]: sdata=np.sort(data,axis=1)
    In [533]: (np.diff(sdata)>0).sum(axis=1)+1            
    Out[533]: array([7, 6, 6, 8, 6])
    

    I was going to add a warning about floats, but if np.unique is working for your data, my approach should work just as well.


    [(np.bincount(i)>0).sum() for i in data]
    

    This is an iterative solution that is clearly faster than my len(set(i)) version, and is competitive with the diff...sort.

    In [585]: data.shape Out[585]: (10000, 400)

    In [586]: timeit [(np.bincount(i)>0).sum() for i in data]
    1 loops, best of 3: 248 ms per loop
    
    In [587]: %%timeit                                       
    sdata=np.sort(data,axis=1)
    (np.diff(sdata)>0).sum(axis=1)+1
       .....: 
    1 loops, best of 3: 280 ms per loop
    

    I just found a faster way to use bincount, np.count_nonzero

    In [715]: timeit np.array([np.count_nonzero(np.bincount(i)) for i in data])
    10 loops, best of 3: 59.6 ms per loop
    

    I was surprised at the speed improvement. But then I recalled that count_nonzero is used in other functions (e.g. np.nonzero) to allocate space for their return results. So it makes sense that this function would be coded for maximum speed. (It doesn't help in the diff...sort case because it does not take an axis parameter).

    0 讨论(0)
提交回复
热议问题