Why doesn't ICU4J match UTF-8 sort order?

后端 未结 2 1043
我在风中等你
我在风中等你 2021-01-26 15:42

I am having a hard time understanding unicode sorting order.

When I run Collator.getInstance(Locale.ENGLISH).compare(\"_\", \"#\") under ICU4J 55.1 I get a

相关标签:
2条回答
  • 2021-01-26 16:37

    First, UTF-8 is just an encoding. It specifies how to store the Unicode code points physically, but does not handle sorting, comparisons, etc.

    Now, the page you linked to shows everything in numerical Code Point order. That is the order things would sort in if using a binary collation (in SQL Server, that would be collations with names ending in _BIN and _BIN2). But the non-binary ordering is far more complex. The rules are described here: Unicode Collation Algorithm (UCA).

    The base rules are found here: http://www.unicode.org/repos/cldr/tags/release-28/common/uca/allkeys_CLDR.txt

    It shows:

    005F  ; [*010A.0020.0002] # LOW LINE
    ...
    0023  ; [*0290.0020.0002] # NUMBER SIGN
    

    It is very important to keep in mind that any locale / culture can override these base rules. Hence, while the few lines noted above explain this specific circumstance, other circumstances would need to check http://www.unicode.org/repos/cldr/tags/release-28/common/collation/ to see if there are any locale-specific overrides.

    0 讨论(0)
  • 2021-01-26 16:41

    Converting Mark Ransom's comments into an answer:

    • The ordering of individual characters is based on a collation table, which has little relationship to the codepoint numbers. See: http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table
    • If you follow the first link on that page, it leads to allkeys.txt which gives the default collation ordering.
    • In particular, _ is 005F ; [*020B.0020.0002] # LOW LINE while # is 0023 ; [*0391.0020.0002] # NUMBER SIGN. Note that the collation numbers for _ are lower than the numbers for #.
    0 讨论(0)
提交回复
热议问题