Ranking of numpy array with possible duplicates

前端 未结 3 2126
你的背包
你的背包 2021-01-17 22:00

I have a numpy array of floats/ints and want to map its elements into their ranks.

If an array doesn\'t have duplicates the problem can be solved by the following co

相关标签:
3条回答
  • 2021-01-17 22:27

    Here is a function that can return the output you desire (in the first case)

    def argsortdup(a1):
      sorted = sort(a1)
      ranked = []
      for item in a1:
        ranked.append(sorted.searchsorted(item))
      return array(ranked)
    

    Basically you sort it and then you search for the index the item is at. Assuming duplicates the first instance index should be returned. I tested it with your a2 example and doing something like

    a3 = argsortdup(a2)
    

    Yields

    array([0, 1, 4, 5, 6, 1, 7, 8, 8, 1])
    

    "Test with a2":

    >>> a2
    array([ 0.1,  1.1,  2.1,  3.1,  4.1,  1.1,  6.1,  7.1,  7.1,  1.1])
    >>> def argsortdup(a1):
    ...   sorted = sort(a1)
    ...   ranked = []
    ...   for item in a1:
    ...     ranked.append(sorted.searchsorted(item))
    ...   return array(ranked)
    ...
    >>> a3 = argsortdup(a2)
    >>> a2
    array([ 0.1,  1.1,  2.1,  3.1,  4.1,  1.1,  6.1,  7.1,  7.1,  1.1])
    >>> a3
    array([0, 1, 4, 5, 6, 1, 7, 8, 8, 1])
    >>>
    
    0 讨论(0)
  • 2021-01-17 22:41

    You can do reasonably well using unique and bincount:

    >>> u, v = np.unique(a2, return_inverse=True)
    >>> (np.cumsum(np.bincount(v)) - 1)[v]
    array([0, 3, 4, 5, 6, 3, 7, 9, 9, 3])
    

    Or, for the minimum rank:

    >>> (np.cumsum(np.concatenate(([0], np.bincount(v)))))[v]
    array([0, 1, 4, 5, 6, 1, 7, 8, 8, 1])
    

    There's a minor speedup by giving bincount the number of bins to provide:

    (np.cumsum(np.bincount(v, minlength=u.size)) - 1)[v]
    
    0 讨论(0)
  • 2021-01-17 22:44

    After upgrading to a latest version of scipy as suggested @WarrenWeckesser in the comments, scipy.stats.rankdata seems to be faster than both scipy.stats.mstats.rankdata and np.searchsorted being the fastet way to do it on larger arrays.

    In [1]: import numpy as np
    
    In [2]: from scipy.stats import rankdata as rd
       ...: from scipy.stats.mstats import rankdata as rd2
       ...: 
    
    In [3]: array = np.arange(0.1, 1000000.1)
    
    In [4]: %timeit np.searchsorted(np.sort(array), array)
    1 loops, best of 3: 385 ms per loop
    
    In [5]: %timeit rd(array)
    10 loops, best of 3: 109 ms per loop
    
    In [6]: %timeit rd2(array)
    1 loops, best of 3: 205 ms per loop
    
    0 讨论(0)
提交回复
热议问题