Can numpy's argsort give equal element the same rank?

前端 未结 4 1016
半阙折子戏
半阙折子戏 2021-02-07 12:04

I want to get the rank of each element, so I use argsort in numpy:

np.argsort(np.array((1,1,1,2,2,3,3,3,3)))
array([0, 1, 2, 3, 4, 5, 6         


        
相关标签:
4条回答
  • 2021-02-07 12:40

    If you don't mind a dependency on scipy, you can use scipy.stats.rankdata, with method='min':

    In [14]: a
    Out[14]: array([1, 1, 1, 2, 2, 3, 3, 3, 3])
    
    In [15]: from scipy.stats import rankdata
    
    In [16]: rankdata(a, method='min')
    Out[16]: array([1, 1, 1, 4, 4, 6, 6, 6, 6])
    

    Note that rankdata starts the ranks at 1. To start at 0, subtract 1 from the result:

    In [17]: rankdata(a, method='min') - 1
    Out[17]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
    

    If you don't want the scipy dependency, you can use numpy.unique to compute the ranking. Here's a function that computes the same result as rankdata(x, method='min') - 1:

    import numpy as np
    
    def rankmin(x):
        u, inv, counts = np.unique(x, return_inverse=True, return_counts=True)
        csum = np.zeros_like(counts)
        csum[1:] = counts[:-1].cumsum()
        return csum[inv]
    

    For example,

    In [137]: x = np.array([60, 10, 0, 30, 20, 40, 50])
    
    In [138]: rankdata(x, method='min') - 1
    Out[138]: array([6, 1, 0, 3, 2, 4, 5])
    
    In [139]: rankmin(x)
    Out[139]: array([6, 1, 0, 3, 2, 4, 5])
    
    In [140]: a = np.array([1,1,1,2,2,3,3,3,3])
    
    In [141]: rankdata(a, method='min') - 1
    Out[141]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
    
    In [142]: rankmin(a)
    Out[142]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
    

    By the way, a single call to argsort() does not give ranks. You can find an assortment of approaches to ranking in the question Rank items in an array using Python/NumPy, including how to do it using argsort().

    0 讨论(0)
  • 2021-02-07 12:47

    I've written a function for the same purpose. It uses pure python and numpy only. Please have a look. I put comments as well.

    def my_argsort(array):
        # this type conversion let us work with python lists and pandas series
        array = np.array(array)
        # create mapping for unique values
        # it's a dictionary where keys are values from the array and
        # values are desired indices 
        unique_values = list(set(array))
        mapping = dict(zip(unique_values, np.argsort(unique_values)))
        # apply mapping to our array
        # np.vectorize works similar map(), and can work with dictionaries
        array = np.vectorize(mapping.get)(array)
        return array
    

    Hope that helps.

    0 讨论(0)
  • 2021-02-07 12:51

    Alternatively, pandas series has a rank method which does what you need with the min method:

    import pandas as pd
    pd.Series((1,1,1,2,2,3,3,3,3)).rank(method="min")
    
    # 0    1
    # 1    1
    # 2    1
    # 3    4
    # 4    4
    # 5    6
    # 6    6
    # 7    6
    # 8    6
    # dtype: float64
    
    0 讨论(0)
  • 2021-02-07 12:57

    With focus on performance, here's an approach -

    def rank_repeat_based(arr):
        idx = np.concatenate(([0],np.flatnonzero(np.diff(arr))+1,[arr.size]))
        return np.repeat(idx[:-1],np.diff(idx))
    

    For a generic case with the elements in input array not already sorted, we would need to use argsort() to keep track of the positions. So, we would have a modified version, like so -

    def rank_repeat_based_generic(arr):    
        sidx = np.argsort(arr,kind='mergesort')
        idx = np.concatenate(([0],np.flatnonzero(np.diff(arr[sidx]))+1,[arr.size]))
        return np.repeat(idx[:-1],np.diff(idx))[sidx.argsort()]
    

    Runtime test

    Testing out all the approaches listed thus far to solve the problem on a large dataset.

    Sorted array case :

    In [96]: arr = np.sort(np.random.randint(1,100,(10000)))
    
    In [97]: %timeit rankdata(arr, method='min') - 1
    1000 loops, best of 3: 635 µs per loop
    
    In [98]: %timeit rankmin(arr)
    1000 loops, best of 3: 495 µs per loop
    
    In [99]: %timeit (pd.Series(arr).rank(method="min")-1).values
    1000 loops, best of 3: 826 µs per loop
    
    In [100]: %timeit rank_repeat_based(arr)
    10000 loops, best of 3: 200 µs per loop
    

    Unsorted case :

    In [106]: arr = np.random.randint(1,100,(10000))
    
    In [107]: %timeit rankdata(arr, method='min') - 1
    1000 loops, best of 3: 963 µs per loop
    
    In [108]: %timeit rankmin(arr)
    1000 loops, best of 3: 869 µs per loop
    
    In [109]: %timeit (pd.Series(arr).rank(method="min")-1).values
    1000 loops, best of 3: 1.17 ms per loop
    
    In [110]: %timeit rank_repeat_based_generic(arr)
    1000 loops, best of 3: 1.76 ms per loop
    
    0 讨论(0)
提交回复
热议问题