Why doesn't ICU4J match UTF-8 sort order?

后端 未结 2 1040
我在风中等你
我在风中等你 2021-01-26 15:42

I am having a hard time understanding unicode sorting order.

When I run Collator.getInstance(Locale.ENGLISH).compare(\"_\", \"#\") under ICU4J 55.1 I get a

2条回答
  •  隐瞒了意图╮
    2021-01-26 16:37

    First, UTF-8 is just an encoding. It specifies how to store the Unicode code points physically, but does not handle sorting, comparisons, etc.

    Now, the page you linked to shows everything in numerical Code Point order. That is the order things would sort in if using a binary collation (in SQL Server, that would be collations with names ending in _BIN and _BIN2). But the non-binary ordering is far more complex. The rules are described here: Unicode Collation Algorithm (UCA).

    The base rules are found here: http://www.unicode.org/repos/cldr/tags/release-28/common/uca/allkeys_CLDR.txt

    It shows:

    005F  ; [*010A.0020.0002] # LOW LINE
    ...
    0023  ; [*0290.0020.0002] # NUMBER SIGN
    

    It is very important to keep in mind that any locale / culture can override these base rules. Hence, while the few lines noted above explain this specific circumstance, other circumstances would need to check http://www.unicode.org/repos/cldr/tags/release-28/common/collation/ to see if there are any locale-specific overrides.

提交回复
热议问题