I am having a hard time understanding unicode sorting order.
When I run Collator.getInstance(Locale.ENGLISH).compare("_", "#")
under ICU4J 55.1 I get a return value of -1
indicating that _
comes before #
.
However, looking at http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec I see that #
(U+0023) comes before _
(U+005F). Why is ICU4J returning a value of -1
?
First, UTF-8 is just an encoding. It specifies how to store the Unicode code points physically, but does not handle sorting, comparisons, etc.
Now, the page you linked to shows everything in numerical Code Point order. That is the order things would sort in if using a binary collation (in SQL Server, that would be collations with names ending in _BIN
and _BIN2
). But the non-binary ordering is far more complex. The rules are described here: Unicode Collation Algorithm (UCA).
The base rules are found here: http://www.unicode.org/repos/cldr/tags/release-28/common/uca/allkeys_CLDR.txt
It shows:
005F ; [*010A.0020.0002] # LOW LINE
...
0023 ; [*0290.0020.0002] # NUMBER SIGN
It is very important to keep in mind that any locale / culture can override these base rules. Hence, while the few lines noted above explain this specific circumstance, other circumstances would need to check http://www.unicode.org/repos/cldr/tags/release-28/common/collation/ to see if there are any locale-specific overrides.
Converting Mark Ransom's comments into an answer:
- The ordering of individual characters is based on a collation table, which has little relationship to the codepoint numbers. See: http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table
- If you follow the first link on that page, it leads to allkeys.txt which gives the default collation ordering.
- In particular,
_
is005F ; [*020B.0020.0002] # LOW LINE
while#
is0023 ; [*0391.0020.0002] # NUMBER SIGN
. Note that the collation numbers for_
are lower than the numbers for#
.
来源:https://stackoverflow.com/questions/32705178/why-doesnt-icu4j-match-utf-8-sort-order