Unicode characters having asymmetric upper/lower case. Why?

后端 未结 2 1957
眼角桃花
眼角桃花 2021-01-04 14:56

Why do the following three characters have not symmetric toLower, toUpper results

/**
  * Written in the Scala programming language         


        
2条回答
  •  情话喂你
    2021-01-04 15:34

    For the first one, there is this explanation:

    In the German language, the Sharp S ("ß" or U+00df) is a lowercase letter, and it capitalizes to the letters "SS".

    In other words, U+1E9E lower-cases to U+00DF, but the upper-case of U+00DF is not U+1E9E.

    For the second one, U+212A (KELVIN SIGN) lower-cases to U+0068 (LATIN SMALL LETTER K). The upper-case of U+0068 is U+004B (LATIN CAPITAL LETTER K). This one seems to make sense to me.

    For the third case, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) is a Turkish/Azerbaijani character that lower-cases to U+0069 (LATIN SMALL LETTER I). I would imagine that if you were somehow in a Turkish/Azerbaijani locale you'd get the proper upper-case version of U+0069, but that might not necessarily be universal.

    Characters need not necessarily have symmetric upper- and lower-case transformations.

    Edit: To respond to PhiLho's comment below, the Unicode 6.0 spec has this to say about U+212A (KELVIN SIGN):

    Three letterlike symbols have been given canonical equivalence to regular letters: U+2126 OHM SIGN, U+212A KELVIN SIGN, and U+212B ANGSTROM SIGN. In all three instances, the regular letter should be used. If text is normalized according to Unicode Standard Annex #15, “Unicode Normalization Forms,” these three characters will be replaced by their regular equivalents.

    In other words, you shouldn't really be using U+212A, you should be using U+004B (LATIN CAPITAL LETTER K) instead, and if you normalize your Unicode text, U+212A should be replaced with U+004B.

提交回复
热议问题