String.equalsIgnoreCase - UpperCase v. LowerCase

问题

I was browsing through the openjdk and noticed a weird code path in String.equalsIgnoreCase, specifically the method regionMatches:

if (ignoreCase) {
    // If characters don't match but case may be ignored,
    // try converting both characters to uppercase.
    // If the results match, then the comparison scan should
    // continue.
    char u1 = Character.toUpperCase(c1);
    char u2 = Character.toUpperCase(c2);
    if (u1 == u2) {
        continue;
    }
    // Unfortunately, conversion to uppercase does not work properly
    // for the Georgian alphabet, which has strange rules about case
    // conversion.  So we need to make one last check before
    // exiting.
    if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
        continue;
    }
}

I understand the comment about adjusting for a specific alphabet to check the lower case equality, but was wondering why even have the upper case check? Why not just do all lower case?

回答1:

Now that the question is re-opened, I transfer my answer here.

The short answer to "Why do they not just compare only lowercase instead of both upper and lower case, if it matches more cases than uppercase?": It does not match more character pairs, it merely matches different pairs.

Comparing only uppercase is not enough, e.g. the ASCII letter "I" and the capital I with dot "İ" ((char)304, used in Turkish alphabet) have different uppercase (they are already uppercase), but they have the same lowercase letter "i". (Note that the Turkish language considers i with dot and i without dot as different letters, not just an accented letter, similar to German with its Umlauts ä/ö/ü vs. a/o/u.)

Comparing only lowercase is not enough, e.g. the ASCII letter "i" and the small dotless i "ı" ((char)305). They have different lowercase (they are already lowercase), but they have the same uppercase letter "I".

And finally, compare capital I with dot "İ" with small dotless i "ı". Neither their uppercases ("İ" vs. "I") nor their lowercases ("i" vs. "ı") match, but the lowercase of their uppercase is the same ("I"). I found another case if this phenomenon, in the greek letters "ϴ" and "ϑ" (char 1012 and 977).

So a true case insensitive comparison can not even check uppercases and lowercases of the original characters, but must check the lowercases of the uppercases.

来源：https://stackoverflow.com/questions/25513019/string-equalsignorecase-uppercase-v-lowercase

标签

java

string

equals

OpenJDK