Python: sort function breaks in the presence of nan

前端未结

关注

 6  1963

sorted([2, float(\'nan\'), 1]) returns [2, nan, 1]

(At least on Activestate Python 3.1 implementation.)

I understand nan


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2020-11-29 07:42
              
            
            
                                                                       
Assuming you want to keep the NaNs and order them as the lowest "values", here is a workaround working both with non-unique nan, unique numpy nan, numerical and non numerical objects:

def is_nan(x):
    return (x is np.nan or x != x)

list_ = [2, float('nan'), 'z', 1, 'a', np.nan, 4, float('nan')]
sorted(list_, key = lambda x : float('-inf') if is_nan(x) else x)
# [nan, nan, nan, 1, 2, 4, 'a', 'z']

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2020-11-29 07:52
              
            
            
                                                                       
The problem is that there's no correct order if the list contains a NAN, since a sequence a1, a2, a3, ..., an is sorted if a1 <= a2 <= a3 <= ... <= an. If any of these a values is a NAN then the sorted property breaks, since for all a, a <= NAN and NAN <= a are both false.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2020-11-29 07:55
              
            
            
                                                                       
The previous answers are useful, but perhaps not clear regarding the root of the problem.

In any language, sort applies a given ordering, defined by a comparison function or in some other way, over the domain of the input values.  For example, less-than, a.k.a. operator <, could be used throughout if and only if less than defines a suitable ordering over the input values.

But this is specifically NOT true for floating point values and less-than:
"NaN is unordered: it is not equal to, greater than, or less than anything, including itself." (Clear prose from GNU C manual, but applies to all modern IEEE754 based floating point)

So the possible solutions are:


  
  remove the NaNs first, making the input domain well defined via <
  (or the other sorting function being used)
  define a custom comparison function (a.k.a. predicate) that does
  define an ordering for NaN, such as less than any number, or greater
  than any number.
  


Either approach can be used, in any language.

Practically, considering python, I would prefer to remove the NaNs if you either don't care much about fastest performance or if removing NaNs is a desired behavior in context. 

Otherwise you could use a suitable predicate function via "cmp" in older python versions, or via this and  functools.cmp_to_key().  The latter is a bit more awkward, naturally, than removing the NaNs first. And care will be required to avoid worse performance, when defining this predicate function.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  逝去的感伤        
                
              
                            
                2020-11-29 07:56
              
            
            
                                                                       
IEEE754 is the standard that defines floating point operations in this instance. This standard defines the compare operation of operands, at least one of which is a NaN, to be an error.  Hence, this is not a bug.  You need to deal with the NaNs before operating on your array.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤独总比滥情好        
                
              
                            
                2020-11-29 08:01
              
            
            
                                                                       
I'm not sure about the bug, but the workaround may be the following:

sorted(
    (2, 1, float('nan')),
    lambda x,y: x is float('nan') and -1 
                or (y is float('nan') and 1
                or cmp(x,y)))


which results in:

('nan', 1, 2)


Or remove nans before sorting or anything else.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2020-11-29 08:01
              
            
            
                                                                       
Regardless of standards, there are many cases where a user-defined ordering of float and NA values is useful. For instance, I was sorting stock returns and wanted highest to lowest with NA last (since those were irrelevant). There are 4 possible combinations

Ascending float values, NA values last
Ascending float values, NA values first
Descending float values, NA values last
Descending float values, NA values first

Here is a function that covers all scenarios by conditionally replacing NA values with +/- inf
import math 

def sort_with_na(x, reverse=False, na_last=True):
    """Intelligently sort iterable with NA values

    For reliable behavior with NA values, we should change the NAs to +/- inf
    to guarantee their order rather than relying on the built-in
    ``sorted(reverse=True)`` which will have no effect. To use the ``reverse``
    parameter or other kwargs, use functools.partial in your lambda i.e.

        sorted(iterable, key=partial(sort_with_na, reverse=True, na_last=False))

    :param x: Element to be sorted
    :param bool na_last: Whether NA values should come last or first
    :param bool reverse: Return ascending if ``False`` else descending
    :return bool:
    """
    if not math.isnan(x):
        return -x if reverse else x
    else:
        return float('inf') if na_last else float('-inf')

Testing out each of the 4 combinations
from functools import partial

a = [2, float('nan'), 1]
sorted(a, key=sort_with_na)                                         # Default
sorted(a, key=partial(sort_with_na, reverse=False, na_last=True))   # Ascend, NA last
sorted(a, key=partial(sort_with_na, reverse=False, na_last=False))  # Ascend, NA first
sorted(a, key=partial(sort_with_na, reverse=True, na_last=True))    # Descend, NA last
sorted(a, key=partial(sort_with_na, reverse=True, na_last=False))   # Descend, NA first

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复