What's the difference between --general-numeric-sort and --numeric-sort options in gnu sort

前端未结

关注

 3  767

sort provides two kinds of numeric sort. This is from the man page:

   -g, --general-numeric-sort
          compare according to general numeric


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  栀梦        
                
              
                            
                2020-12-02 12:10
              
            
            
                                                                       
General numeric sort compares the numbers as floats, this allows scientific notation eg 1.234E10 but is slower and subject to rounding error  (1.2345678 could come after 1.2345679), numeric sort is just a regular alphabetic sort that knows 10 comes after 9.    

See http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html


  ‘-g’ ‘--general-numeric-sort’
  ‘--sort=general-numeric’ Sort
  numerically, using the standard C
  function strtod to convert a prefix of
  each line to a double-precision
  floating point number. This allows
  floating point numbers to be specified
  in scientific notation, like 1.0e-34
  and 10e100. The LC_NUMERIC locale
  determines the decimal-point
  character. Do not report overflow,
  underflow, or conversion errors. Use
  the following collating sequence: 
  Lines that do not start with numbers
  (all considered to be equal).  NaNs
  (“Not a Number” values, in IEEE
  floating point arithmetic) in a
  consistent but machine-dependent
  order.  Minus infinity.  Finite
  numbers in ascending numeric order
  (with -0 and +0 equal).  Plus
  infinity. 
  
  Use this option only if there is no
  alternative; it is much slower than
  --numeric-sort (-n) and it can lose information when converting to
  floating point. 
  
  ‘-n’ ‘--numeric-sort’ ‘--sort=numeric’
  Sort numerically. The number begins
  each line and consists of optional
  blanks, an optional ‘-’ sign, and zero
  or more digits possibly separated by
  thousands separators, optionally
  followed by a decimal-point character
  and zero or more digits. An empty
  number is treated as ‘0’. The
  LC_NUMERIC locale specifies the
  decimal-point character and thousands
  separator. By default a blank is a
  space or a tab, but the LC_CTYPE
  locale can change this. 
  
  Comparison is exact; there is no
  rounding error. 
  
  Neither a leading ‘+’ nor exponential
  notation is recognized. To compare
  such strings numerically, use the
  --general-numeric-sort (-g) option.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2020-12-02 12:11
              
            
            
                                                                       
You should be careful with your locale. For example, you might intend to sort a floating number (like 2.2) whereas your locale might expect the use of a comma (like 2,2).

As reported in this forum, you may have wrong results using the -n or -g flags.

In my case I use:

LC_ALL=C sort -k 6,6n file


in order to sort the 6th column that contains:

2.5
3.7
1.4


in order to obtain

1.4
2.5
3.7

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  余生分开走        
                
              
                            
                2020-12-02 12:14
              
            
            
                                                                       
In addition to the accepted answer which mention -g allow scientific notation, I want to shows the part which most likely causes undesirable behavior.

With -g:

$ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=en_US.UTF-8 sort -g myfile
baa
--inf
--inf  
--inf- 
--inf--
--inf-a
--nnf
nnf--
   nnn  
tnan
zoo
   naN
Nana
nani lol
-inf
-inf--
-11
-2
-1
1
+1
2
+2
0xa
11
+11
inf


Look at the zoo, three important things here:


Line starts with NAN(e.g. Nana and nani lol) or -INF(single dash, not --INF) move to end but before digits.
While INF move to the last after digits because it means
infinity.
The NAN, INF, and -INF are case insensitive.
The lines always ignore whitespace from either side of NAN, INF,  -INF (regardless of LC_CTYPE). Other alphabetic may ignore whitespace from either side depends on locale LC_COLLATE (e.g. LC_COLLATE=fr_FR.UTF-8 ignore but LC_COLLATE=us_EN.UTF-8 not ignore).


So if you are sorting arbitrary alphanumeric then you probably don't want -g. If you really need scientific notation comparison with -g, then you probably want to extract alphabet and
   numeric data and do comparison separately.

If you only need ordinary number(e.g. 1, -1) sorting, and feel that 0x/E/+ sorting not important, just use -n enough:

$ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=en_US.UTF-8 sort -n myfile
-1000
-22
-13
-11
-010
-10
-5
-2
-1
-0.2
-0.12
-0.11
-0.1
0x1
0x11
0xb
+1
+11
+2
-a
-aa
--aa
-aaa
-b
baa
BAA
bbb
+ignore
inf
-inf
--inf
--inf  
--inf- 
--inf--
-inf--
--inf-a
   naN
Nana
nani lol
--nnf
nnf--
   nnn  
None         
uum
Zero cool
-zzz
1
1.1
1.234E10
5
11


Either of -g or -n, be aware of locale effect. You may want to specify LC_NUMERIC as us_EN.UTF-8 to avoid fr_FR.UTF-8 sort - with floating number failed:

$ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=fr_FR.UTF-8 sort -n myfile
-10
-5
-2
-1
-1.1
-1.2
-0.1
-0.11
-0.12
-0.2
-a
+b
middle
-wwe
+zoo
1
1.1


With LC_NUMERIC=en_US.UTF-8:

$ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=en_US.UTF-8 sort -n myfile
-10
-5
-2
-1.2
-1.1
-1
-0.2
-0.12
-0.11
-0.1
-a
+b
middle
-wwe
+zoo
1
1.1


Or LC_NUMERIC=us_EN.UTF-8 to group +|-|space with alpha:

$ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=us_EN.UTF-8 sort -n myfile
-0.1
    a
    b
 a
 b
+b
+zoo
-a
-wwe
middle
1


You probably want to specify locale when using sort if want to write portable script.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复