Cannot understand numpy argpartition output

前端 未结 3 1307
孤独总比滥情好
孤独总比滥情好 2020-12-05 14:05

I am trying to use arpgpartition from numpy, but it seems there is something going wrong and I cannot seem to figure it out. Here is what\'s happening:

These are fir

相关标签:
3条回答
  • 2020-12-05 14:29

    We need to use list of indices that are to be kept in sorted order instead of feeding the kth param as a scalar. Thus, to maintain the sorted nature across the first 5 elements, instead of np.argpartition(a,5)[:5], simply do -

    np.argpartition(a,range(5))[:5]
    

    Here's a sample run to make things clear -

    In [84]: a = np.random.rand(10)
    
    In [85]: a
    Out[85]: 
    array([ 0.85017222,  0.19406266,  0.7879974 ,  0.40444978,  0.46057793,
            0.51428578,  0.03419694,  0.47708   ,  0.73924536,  0.14437159])
    
    In [86]: a[np.argpartition(a,5)[:5]]
    Out[86]: array([ 0.19406266,  0.14437159,  0.03419694,  0.40444978,  0.46057793])
    
    In [87]: a[np.argpartition(a,range(5))[:5]]
    Out[87]: array([ 0.03419694,  0.14437159,  0.19406266,  0.40444978,  0.46057793])
    

    Please note that argpartition makes sense on performance aspect, if we are looking to get sorted indices for a small subset of elements, let's say k number of elems which is a small fraction of the total number of elems.

    Let's use a bigger dataset and try to get sorted indices for all elems to make the above mentioned point clear -

    In [51]: a = np.random.rand(10000)*100
    
    In [52]: %timeit np.argpartition(a,range(a.size-1))[:5]
    10 loops, best of 3: 105 ms per loop
    
    In [53]: %timeit a.argsort()
    1000 loops, best of 3: 893 µs per loop
    

    Thus, to sort all elems, np.argpartition isn't the way to go.

    Now, let's say I want to get sorted indices for only the first 5 elems with that big dataset and also keep the order for those -

    In [68]: a = np.random.rand(10000)*100
    
    In [69]: np.argpartition(a,range(5))[:5]
    Out[69]: array([1647,  942, 2167, 1371, 2571])
    
    In [70]: a.argsort()[:5]
    Out[70]: array([1647,  942, 2167, 1371, 2571])
    
    In [71]: %timeit np.argpartition(a,range(5))[:5]
    10000 loops, best of 3: 112 µs per loop
    
    In [72]: %timeit a.argsort()[:5]
    1000 loops, best of 3: 888 µs per loop
    

    Very useful here!

    0 讨论(0)
  • 2020-12-05 14:44

    Given the task of indirectly sorting a subset (the top k, top meaning first in sort order) there are two builtin solutions: argsort and argpartition cf. @Divakar's answer.

    If, however, performance is a consideration then it may (depending on the sizes of the data and the subset of interest) be well worth resisting the "lure of the one-liner", investing one more line and applying argsort on the output of argpartition:

    >>> def top_k_sort(a, k):
    ...     return np.argsort(a)[:k]
    ...
    >>> def top_k_argp(a, k):
    ...     return np.argpartition(a, range(k))[:k]
    ...
    >>> def top_k_hybrid(a, k):
    ...     b = np.argpartition(a, k)[:k]
    ...     return b[np.argsort(a[b])]
    
    >>> k = 100
    >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_sort, 'rng': np.random.random, 'k': k})
    8.348663672804832
    >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_argp, 'rng': np.random.random, 'k': k})
    9.869098862167448
    >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_hybrid, 'rng': np.random.random, 'k': k})
    1.2305558240041137
    

    argsort is O(n log n), argpartition with range argument appears to be O(nk) (?), and argpartition + argsort is O(n + k log k)

    Therefore in an interesting regime n >> k >> 1 the hybrid method is expected to be fastest

    UPDATE: ND version:

    import numpy as np
    from timeit import timeit
    
    def top_k_sort(A,k,axis=-1):
        return A.argsort(axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))]
    
    def top_k_partition(A,k,axis=-1):
        return A.argpartition(range(k),axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))]
    
    def top_k_hybrid(A,k,axis=-1):
        B = A.argpartition(k,axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))]
        return np.take_along_axis(B,np.take_along_axis(A,B,axis).argsort(axis),axis)
    
    A = np.random.random((100,10000))
    k = 100
    
    from timeit import timeit
    
    for f in globals().copy():
        if f.startswith("top_"):
            print(f, timeit(f"{f}(A,k)",globals=globals(),number=10)*100)
    

    Sample run:

    top_k_sort 63.72379460372031
    top_k_partition 99.30561298970133
    top_k_hybrid 10.714635509066284
    
    0 讨论(0)
  • 2020-12-05 14:56

    Let's describe the partition method in a simplified way which helps a lot understand argpartition

    Following the example in the picture if we execute C=numpy.argpartition(A, 3) C will be the resulting array of getting the position of every element in B with respect to the A array. ie:

    Idx(z) = index of element z in array A
    
    then C would be
    
    C = [ Idx(B[0]), Idx(B[1]), Idx(B[2]), Idx(X), Idx(B[4]), ..... Idx(B[N]) ]
    

    As previously mentioned this method is very helpful and comes very handy when you have a huge array and you are only interested in a selected group of ordered elements, not the whole array.

    0 讨论(0)
提交回复
热议问题