问题
In Java the String#toLowerCase
method uses the default system Locale
to determine how to handle lowercasing. If I am lowercasing some ASCII text and want to be sure that this is processed as expected which Locale should I use?
EDIT: I'm mainly concerned about programming identifiers such as table and column names in a schema. As such I want English lower casing to apply.
Locale.ROOT
states that it is the language/country neutral locale for the locale sensitive operations
Locale.ENGLISH
would presumably also be a safe choice.
回答1:
Yes, Locale.ENGLISH
is a safe choice for case operations for things like programming language identifiers and URL parts since it doesn't involve any special casing rules and all 7-bit ASCII characters in the ENGLISH case-convert to 7-bit ASCII characters.
That is not true for all other locales. In Turkish, the 'I' and 'i' characters are not case-converted to one another.
"Dotted and dotless I" explains:
The Turkish alphabet, which is a variant of the Latin alphabet, includes two distinct versions of the letter I, one dotted and the other dotless.
In Unicode, U+0131 is a lower case letter dotless i (ı). U+0130 (İ) is capital i with dot. ISO-8859-9 has them at positions 0xFD and 0xDD respectively. In normal typography, when lower case i is combined with other diacritics, the dot is generally removed before the diacritic is added; however, Unicode still lists the equivalent combining sequences as including the dotted i, since logically it is the normal dotted i character that is being modified.
Most Unicode software uppercases ı to I and lowercases İ to i, but, unless specifically set up for Turkish, it lowercases I to i and uppercases i to I. Thus uppercasing then lowercasing, or vice versa, changes the letters.
The list of special exceptions is maintained at http://unicode.org/Public/UNIDATA/SpecialCasing.txt
# ================================================================================ # Turkish and Azeri # I and i-dotless; I-dot and i are case pairs in Turkish and Azeri # The following rules handle those cases. 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE 0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE # When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i. # This matches the behavior of the canonically equivalent I-dot_above 0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE 0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE
...
回答2:
If I am lowercasing some ASCII text and want to be sure that this is processed as expected which Locale should I use?
That depends on what "as expected" means for you. The point of allowing to specify a Locale is that uppercaseing/lowercasing does not work the same in all languages, even though they may use the same letters. So specify the Locale you and/or your customers live in, and it will probably work as you/they expect.
来源:https://stackoverflow.com/questions/10336730/which-locale-should-i-specify-when-i-call-stringtolowercase