问题
I was browsing through the openjdk and noticed a weird code path in String.equalsIgnoreCase, specifically the method regionMatches:
if (ignoreCase) {
// If characters don't match but case may be ignored,
// try converting both characters to uppercase.
// If the results match, then the comparison scan should
// continue.
char u1 = Character.toUpperCase(c1);
char u2 = Character.toUpperCase(c2);
if (u1 == u2) {
continue;
}
// Unfortunately, conversion to uppercase does not work properly
// for the Georgian alphabet, which has strange rules about case
// conversion. So we need to make one last check before
// exiting.
if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
continue;
}
}
I understand the comment about adjusting for a specific alphabet to check the lower case equality, but was wondering why even have the upper case check? Why not just do all lower case?
回答1:
Now that the question is re-opened, I transfer my answer here.
The short answer to "Why do they not just compare only lowercase instead of both upper and lower case, if it matches more cases than uppercase?": It does not match more character pairs, it merely matches different pairs.
Comparing only uppercase is not enough, e.g. the ASCII letter "I" and the capital I with dot "İ" ((char)304
, used in Turkish alphabet) have different uppercase (they are already uppercase), but they have the same lowercase letter "i". (Note that the Turkish language considers i with dot and i without dot as different letters, not just an accented letter, similar to German with its Umlauts ä/ö/ü vs. a/o/u.)
Comparing only lowercase is not enough, e.g. the ASCII letter "i" and the small dotless i "ı" ((char)305
). They have different lowercase (they are already lowercase), but they have the same uppercase letter "I".
And finally, compare capital I with dot "İ" with small dotless i "ı". Neither their uppercases ("İ" vs. "I") nor their lowercases ("i" vs. "ı") match, but the lowercase of their uppercase is the same ("I"). I found another case if this phenomenon, in the greek letters "ϴ" and "ϑ" (char 1012 and 977).
So a true case insensitive comparison can not even check uppercases and lowercases of the original characters, but must check the lowercases of the uppercases.
来源:https://stackoverflow.com/questions/25513019/string-equalsignorecase-uppercase-v-lowercase