Can numpy's argsort give equal element the same rank?

前端未结

关注

 4  1016

I want to get the rank of each element, so I use argsort in numpy:

np.argsort(np.array((1,1,1,2,2,3,3,3,3)))
array([0, 1, 2, 3, 4, 5, 6


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  深忆病人        
                
              
                            
                2021-02-07 12:40
              
            
            
                                                                       
If you don't mind a dependency on scipy, you can use scipy.stats.rankdata, with method='min':

In [14]: a
Out[14]: array([1, 1, 1, 2, 2, 3, 3, 3, 3])

In [15]: from scipy.stats import rankdata

In [16]: rankdata(a, method='min')
Out[16]: array([1, 1, 1, 4, 4, 6, 6, 6, 6])


Note that rankdata starts the ranks at 1.  To start at 0, subtract 1 from the result:

In [17]: rankdata(a, method='min') - 1
Out[17]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])




If you don't want the scipy dependency, you can use numpy.unique to compute the ranking.  Here's a function that computes the same result as rankdata(x, method='min') - 1:

import numpy as np

def rankmin(x):
    u, inv, counts = np.unique(x, return_inverse=True, return_counts=True)
    csum = np.zeros_like(counts)
    csum[1:] = counts[:-1].cumsum()
    return csum[inv]


For example,

In [137]: x = np.array([60, 10, 0, 30, 20, 40, 50])

In [138]: rankdata(x, method='min') - 1
Out[138]: array([6, 1, 0, 3, 2, 4, 5])

In [139]: rankmin(x)
Out[139]: array([6, 1, 0, 3, 2, 4, 5])

In [140]: a = np.array([1,1,1,2,2,3,3,3,3])

In [141]: rankdata(a, method='min') - 1
Out[141]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])

In [142]: rankmin(a)
Out[142]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])




By the way, a single call to argsort() does not give ranks.  You can find an assortment of approaches to ranking in the question Rank items in an array using Python/NumPy, including how to do it using argsort().
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  鱼传尺愫        
                
              
                            
                2021-02-07 12:47
              
            
            
                                                                       
I've written a function for the same purpose. It uses pure python and numpy only. Please have a look. I put comments as well.

def my_argsort(array):
    # this type conversion let us work with python lists and pandas series
    array = np.array(array)
    # create mapping for unique values
    # it's a dictionary where keys are values from the array and
    # values are desired indices 
    unique_values = list(set(array))
    mapping = dict(zip(unique_values, np.argsort(unique_values)))
    # apply mapping to our array
    # np.vectorize works similar map(), and can work with dictionaries
    array = np.vectorize(mapping.get)(array)
    return array


Hope that helps.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  粉色の甜心        
                
              
                            
                2021-02-07 12:51
              
            
            
                                                                       
Alternatively, pandas series has a rank method which does what you need with the min method:

import pandas as pd
pd.Series((1,1,1,2,2,3,3,3,3)).rank(method="min")

# 0    1
# 1    1
# 2    1
# 3    4
# 4    4
# 5    6
# 6    6
# 7    6
# 8    6
# dtype: float64

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野性不改        
                
              
                            
                2021-02-07 12:57
              
            
            
                                                                       
With focus on performance, here's an approach -

def rank_repeat_based(arr):
    idx = np.concatenate(([0],np.flatnonzero(np.diff(arr))+1,[arr.size]))
    return np.repeat(idx[:-1],np.diff(idx))


For a generic case with the elements in input array not already sorted, we would need to use argsort() to keep track of the positions. So, we would have a modified version, like so -

def rank_repeat_based_generic(arr):    
    sidx = np.argsort(arr,kind='mergesort')
    idx = np.concatenate(([0],np.flatnonzero(np.diff(arr[sidx]))+1,[arr.size]))
    return np.repeat(idx[:-1],np.diff(idx))[sidx.argsort()]


Runtime test

Testing out all the approaches listed thus far to solve the problem on a large dataset.

Sorted array case :

In [96]: arr = np.sort(np.random.randint(1,100,(10000)))

In [97]: %timeit rankdata(arr, method='min') - 1
1000 loops, best of 3: 635 µs per loop

In [98]: %timeit rankmin(arr)
1000 loops, best of 3: 495 µs per loop

In [99]: %timeit (pd.Series(arr).rank(method="min")-1).values
1000 loops, best of 3: 826 µs per loop

In [100]: %timeit rank_repeat_based(arr)
10000 loops, best of 3: 200 µs per loop


Unsorted case :

In [106]: arr = np.random.randint(1,100,(10000))

In [107]: %timeit rankdata(arr, method='min') - 1
1000 loops, best of 3: 963 µs per loop

In [108]: %timeit rankmin(arr)
1000 loops, best of 3: 869 µs per loop

In [109]: %timeit (pd.Series(arr).rank(method="min")-1).values
1000 loops, best of 3: 1.17 ms per loop

In [110]: %timeit rank_repeat_based_generic(arr)
1000 loops, best of 3: 1.76 ms per loop

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复