Can numpy's argsort give equal element the same rank?

前端 未结 4 1010
半阙折子戏
半阙折子戏 2021-02-07 12:04

I want to get the rank of each element, so I use argsort in numpy:

np.argsort(np.array((1,1,1,2,2,3,3,3,3)))
array([0, 1, 2, 3, 4, 5, 6         


        
4条回答
  •  野性不改
    2021-02-07 12:57

    With focus on performance, here's an approach -

    def rank_repeat_based(arr):
        idx = np.concatenate(([0],np.flatnonzero(np.diff(arr))+1,[arr.size]))
        return np.repeat(idx[:-1],np.diff(idx))
    

    For a generic case with the elements in input array not already sorted, we would need to use argsort() to keep track of the positions. So, we would have a modified version, like so -

    def rank_repeat_based_generic(arr):    
        sidx = np.argsort(arr,kind='mergesort')
        idx = np.concatenate(([0],np.flatnonzero(np.diff(arr[sidx]))+1,[arr.size]))
        return np.repeat(idx[:-1],np.diff(idx))[sidx.argsort()]
    

    Runtime test

    Testing out all the approaches listed thus far to solve the problem on a large dataset.

    Sorted array case :

    In [96]: arr = np.sort(np.random.randint(1,100,(10000)))
    
    In [97]: %timeit rankdata(arr, method='min') - 1
    1000 loops, best of 3: 635 µs per loop
    
    In [98]: %timeit rankmin(arr)
    1000 loops, best of 3: 495 µs per loop
    
    In [99]: %timeit (pd.Series(arr).rank(method="min")-1).values
    1000 loops, best of 3: 826 µs per loop
    
    In [100]: %timeit rank_repeat_based(arr)
    10000 loops, best of 3: 200 µs per loop
    

    Unsorted case :

    In [106]: arr = np.random.randint(1,100,(10000))
    
    In [107]: %timeit rankdata(arr, method='min') - 1
    1000 loops, best of 3: 963 µs per loop
    
    In [108]: %timeit rankmin(arr)
    1000 loops, best of 3: 869 µs per loop
    
    In [109]: %timeit (pd.Series(arr).rank(method="min")-1).values
    1000 loops, best of 3: 1.17 ms per loop
    
    In [110]: %timeit rank_repeat_based_generic(arr)
    1000 loops, best of 3: 1.76 ms per loop
    

提交回复
热议问题