What's the difference between --general-numeric-sort and --numeric-sort options in gnu sort

前端 未结 3 767
野趣味
野趣味 2020-12-02 11:55

sort provides two kinds of numeric sort. This is from the man page:

   -g, --general-numeric-sort
          compare according to general numeric         


        
相关标签:
3条回答
  • 2020-12-02 12:10

    General numeric sort compares the numbers as floats, this allows scientific notation eg 1.234E10 but is slower and subject to rounding error (1.2345678 could come after 1.2345679), numeric sort is just a regular alphabetic sort that knows 10 comes after 9.

    See http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html

    ‘-g’ ‘--general-numeric-sort’ ‘--sort=general-numeric’ Sort numerically, using the standard C function strtod to convert a prefix of each line to a double-precision floating point number. This allows floating point numbers to be specified in scientific notation, like 1.0e-34 and 10e100. The LC_NUMERIC locale determines the decimal-point character. Do not report overflow, underflow, or conversion errors. Use the following collating sequence: Lines that do not start with numbers (all considered to be equal). NaNs (“Not a Number” values, in IEEE floating point arithmetic) in a consistent but machine-dependent order. Minus infinity. Finite numbers in ascending numeric order (with -0 and +0 equal). Plus infinity.

    Use this option only if there is no alternative; it is much slower than --numeric-sort (-n) and it can lose information when converting to floating point.

    ‘-n’ ‘--numeric-sort’ ‘--sort=numeric’ Sort numerically. The number begins each line and consists of optional blanks, an optional ‘-’ sign, and zero or more digits possibly separated by thousands separators, optionally followed by a decimal-point character and zero or more digits. An empty number is treated as ‘0’. The LC_NUMERIC locale specifies the decimal-point character and thousands separator. By default a blank is a space or a tab, but the LC_CTYPE locale can change this.

    Comparison is exact; there is no rounding error.

    Neither a leading ‘+’ nor exponential notation is recognized. To compare such strings numerically, use the --general-numeric-sort (-g) option.

    0 讨论(0)
  • 2020-12-02 12:11

    You should be careful with your locale. For example, you might intend to sort a floating number (like 2.2) whereas your locale might expect the use of a comma (like 2,2).

    As reported in this forum, you may have wrong results using the -n or -g flags.

    In my case I use:

    LC_ALL=C sort -k 6,6n file
    

    in order to sort the 6th column that contains:

    2.5
    3.7
    1.4
    

    in order to obtain

    1.4
    2.5
    3.7
    
    0 讨论(0)
  • 2020-12-02 12:14

    In addition to the accepted answer which mention -g allow scientific notation, I want to shows the part which most likely causes undesirable behavior.

    With -g:

    $ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=en_US.UTF-8 sort -g myfile
    baa
    --inf
    --inf  
    --inf- 
    --inf--
    --inf-a
    --nnf
    nnf--
       nnn  
    tnan
    zoo
       naN
    Nana
    nani lol
    -inf
    -inf--
    -11
    -2
    -1
    1
    +1
    2
    +2
    0xa
    11
    +11
    inf
    

    Look at the zoo, three important things here:

    • Line starts with NAN(e.g. Nana and nani lol) or -INF(single dash, not --INF) move to end but before digits. While INF move to the last after digits because it means infinity.

    • The NAN, INF, and -INF are case insensitive.

    • The lines always ignore whitespace from either side of NAN, INF, -INF (regardless of LC_CTYPE). Other alphabetic may ignore whitespace from either side depends on locale LC_COLLATE (e.g. LC_COLLATE=fr_FR.UTF-8 ignore but LC_COLLATE=us_EN.UTF-8 not ignore).

    So if you are sorting arbitrary alphanumeric then you probably don't want -g. If you really need scientific notation comparison with -g, then you probably want to extract alphabet and numeric data and do comparison separately.

    If you only need ordinary number(e.g. 1, -1) sorting, and feel that 0x/E/+ sorting not important, just use -n enough:

    $ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=en_US.UTF-8 sort -n myfile
    -1000
    -22
    -13
    -11
    -010
    -10
    -5
    -2
    -1
    -0.2
    -0.12
    -0.11
    -0.1
    0x1
    0x11
    0xb
    +1
    +11
    +2
    -a
    -aa
    --aa
    -aaa
    -b
    baa
    BAA
    bbb
    +ignore
    inf
    -inf
    --inf
    --inf  
    --inf- 
    --inf--
    -inf--
    --inf-a
       naN
    Nana
    nani lol
    --nnf
    nnf--
       nnn  
    None         
    uum
    Zero cool
    -zzz
    1
    1.1
    1.234E10
    5
    11
    

    Either of -g or -n, be aware of locale effect. You may want to specify LC_NUMERIC as us_EN.UTF-8 to avoid fr_FR.UTF-8 sort - with floating number failed:

    $ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=fr_FR.UTF-8 sort -n myfile
    -10
    -5
    -2
    -1
    -1.1
    -1.2
    -0.1
    -0.11
    -0.12
    -0.2
    -a
    +b
    middle
    -wwe
    +zoo
    1
    1.1
    

    With LC_NUMERIC=en_US.UTF-8:

    $ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=en_US.UTF-8 sort -n myfile
    -10
    -5
    -2
    -1.2
    -1.1
    -1
    -0.2
    -0.12
    -0.11
    -0.1
    -a
    +b
    middle
    -wwe
    +zoo
    1
    1.1
    

    Or LC_NUMERIC=us_EN.UTF-8 to group +|-|space with alpha:

    $ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=us_EN.UTF-8 sort -n myfile
    -0.1
        a
        b
     a
     b
    +b
    +zoo
    -a
    -wwe
    middle
    1
    

    You probably want to specify locale when using sort if want to write portable script.

    0 讨论(0)
提交回复
热议问题