I am having a hard time understanding unicode sorting order.
When I run Collator.getInstance(Locale.ENGLISH).compare(\"_\", \"#\")
under ICU4J 55.1 I get a
First, UTF-8 is just an encoding. It specifies how to store the Unicode code points physically, but does not handle sorting, comparisons, etc.
Now, the page you linked to shows everything in numerical Code Point order. That is the order things would sort in if using a binary collation (in SQL Server, that would be collations with names ending in _BIN
and _BIN2
). But the non-binary ordering is far more complex. The rules are described here: Unicode Collation Algorithm (UCA).
The base rules are found here: http://www.unicode.org/repos/cldr/tags/release-28/common/uca/allkeys_CLDR.txt
It shows:
005F ; [*010A.0020.0002] # LOW LINE
...
0023 ; [*0290.0020.0002] # NUMBER SIGN
It is very important to keep in mind that any locale / culture can override these base rules. Hence, while the few lines noted above explain this specific circumstance, other circumstances would need to check http://www.unicode.org/repos/cldr/tags/release-28/common/collation/ to see if there are any locale-specific overrides.
Converting Mark Ransom's comments into an answer:
_
is 005F ; [*020B.0020.0002] # LOW LINE
while #
is 0023 ; [*0391.0020.0002] # NUMBER SIGN
. Note that the collation numbers for _
are lower than the numbers for #
.