Unix sort treatment of underscore character

后端 未结 5 969
半阙折子戏
半阙折子戏 2020-12-30 01:16

I have two linux machines, on which unix sort seems to behave differently. I believe I\'ve narrowed it down to the treatment of the underscore character.

If I run

相关标签:
5条回答
  • 2020-12-30 01:25

    I really liked the answer above with the useful example, i'd just add another string to its list to show how strange the sorting behavior can be:

    $ (echo 'foo_bar'; echo 'fooAbar'; echo 'foo0bar'; echo 'fooabar'; echo 'foobbar'; echo 'foobar') | LC_COLLATE=en_US.UTF-8 sort --debug
    sort: using ‘en_US.UTF-8’ sorting rules
    foo0bar
    _______
    fooabar
    _______
    fooAbar
    _______
    foobar
    ______
    foo_bar
    _______
    foobbar
    _______
    

    Seems crazy right ? Found the explanation here, in this case it's because the unicode collation algorithm is being used in this locale : https://unix.stackexchange.com/questions/252419/unexpected-sort-order-in-en-us-utf-8-locale

    HOWEVER, even the 'sort --debug' option is not able to easily demonstrate the subtleties that go into the strcoll() function's rules for obeying the locale sorting specification.

    POSIX stipulates that locale authors (for all but the C locale) have absolute control over all sorts of fiddly aspects of how strcoll() behaves, and the fact that two vendors declare that their locale is named en_US.UTF-8 does NOT imply/require those two vendors to have the same locale definition. So the collation rules between two different platforms are very likely different, based on whoever wrote the locale file for that platform, and what bug fixes have been incorporated into the locale definition over time.

    Thank you Eric Blake at Red Hat for this insight.

    0 讨论(0)
  • 2020-12-30 01:31

    This is likely caused by a difference in locale. In the en_US.UTF-8 locale, underscores (_) sort after letters and numbers, whereas in the POSIX C locale they sort after uppercase letters and numbers, but before lowercase numbers.

    # won't change LC_COLLATE=C after execution
    $ LC_COLLATE=C sort filename
    

    You can also use sort --debug to show more information about the sorting behavior in general:

    $ (echo 'foo_bar'; echo 'fooAbar'; echo 'foo0bar'; echo 'fooabar') |
          LC_COLLATE=en_US.UTF-8 sort --debug
    sort: using ‘en_US.UTF-8’ sorting rules
    foo0bar
    fooabar
    fooAbar
    foo_bar
    
    $ (echo 'foo_bar'; echo 'fooAbar'; echo 'foo0bar'; echo 'fooabar') | 
          LC_COLLATE=C sort --debug
    sort: using simple byte comparison
    foo0bar
    fooAbar
    foo_bar
    fooabar
    

    As also shown in this answer, you can use the above formula to force LC_COLLATE=C for a single command, without modifying your shell environment:

    0 讨论(0)
  • 2020-12-30 01:34

    The difference is due to your locale. Use the locale command to check the current settings.

    There are a number of different locale categories, such as LC_COLLATE, LC_TIME, and LC_MESSAGES. You can change them all by setting the environment variable LC_ALL or LANG, or only the collation (sort) order by setting the environment variable LC_COLLATE. The locale C or POSIX is a basic locale defined by the standard; others include en_US (US English), fr_FR (French), etc.

    0 讨论(0)
  • 2020-12-30 01:37

    You can set LC_COLLATE to traditional sort order just for your command:

    env LC_COLLATE=C sort tmp
    

    This won't change the current environment just the one in which the sort command executes. You should have the same behaviour with this.

    0 讨论(0)
  • 2020-12-30 01:48

    sort order depends on the current value of the environment variable LC_COLLATE. Check your local documentation for 'locale', 'setlocale', etc. Set LC_COLLATE to 'POSIX' on both machines, and the results should match.

    0 讨论(0)
提交回复
热议问题