Unicode-correct title case in Java

前端 未结 3 642
被撕碎了的回忆
被撕碎了的回忆 2021-01-01 09:41

I\'ve been looking through all StackOverflow in the bazillion of questions about capitalizing a word in Java, and none of them seem to care the least about internationalizat

相关标签:
3条回答
  • 2021-01-01 10:27

    The only two character digraph in which both characters are capitalized at once and that you probably will encounter in a real life program is the Dutch IJ. Just handle it if the locale is Dutch. In the worst improbable scenario, there will be 1-2 cases that you'll need to add later, it is not that you'll encounter new capitalization digraph every day so it is not worth focusing on generalization here.

    Note, in general, it is not possible to use character to character conversion to get either title or upper case for an arbitrary language. Some lower case characters translate to more than one upper case characters. So you have to use String in a generic case.

    But there is no any problem with title case locale. There is probably a small misunderstanding about how toTitleCase() method works. It will convert to title case any character, including one that is already in the upper case.

    For example, consider the dž character. It's upper case form is DŽ and the title case form is Dž:

    System.out.println(Character.toUpperCase('\u01C4'));
    DŽ
    

    and

    System.out.println(Character.toTitleCase('\u01C4'));
    Dž
    

    however, the following will also give title case

    System.out.println(Character.toTitleCase(Character.toUpperCase('\u01C4')));
    Dž
    

    So, if you convert with locale to upper case before title case, you get the correct code point and there is no problem to use title case on the result, including Turkish, etc.:

    System.out.println(Character.toTitleCase("dž".toUpperCase().charAt(0)));
    System.out.println(Character.toTitleCase("i".toUpperCase(Locale.forLanguageTag("tr")).charAt(0)));
    Dž
    İ
    

    Note, just using title case of a single character if there is a difference from its upper case is not correct in a generic case.

    To summarize:

    • Handle Dutch digraph (or other digraphs if you encounter them, I highly doubt that and at worst it will be 1-2 cases for program lifetime).
    • Convert the required characters as String using locale and toUpperCase()
    • Convert all characters of the toUpperCase result using Character toTitleCase.

    Note, there are still some capitalization cases that are context aware, like Irish prefix, English ff names, etc. which require more than just a character/string processing, but I doubt you need to handle them for title generation in a program.

    0 讨论(0)
  • 2021-01-01 10:36

    The problem is that the differentiation of upper and lower case letters is very language specific. So many, maybe most languages, do not have such.

    Anyway, there is a Unicode faq: http://www.unicode.org/faq/casemap_charprop.html

    ..and I guess there is a Unicode specific mapping table somewhere (something like that ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt). So its probably best to use your own conversion method.

    0 讨论(0)
  • 2021-01-01 10:44

    Like you, I was unable to find a suitable method in the core Java API.

    However, there does seem to be a locale-sensitive string-title-case method (UCharacter#toTitleCase) in the ICU library.


    Looking at the source for the relevant ICU methods (UCharacter#toTitleCase and UCaseProps#toUpperOrTitle), there don't seem to be many locale-specific special cases for title-casing, so you might be able to get away with the following:

    1. Find the first cased character in the string.
    2. If it has a title-case form distinct from its upper-case form, use that.
    3. Otherwise, perform a locale-sensitive upper-case on that first character and its combining characters.
    4. Perform a locale-sensitive lower-case on the rest of the string.
    5. If the locale is Dutch and the first cased character is an "I" followed by a "j", upper-case the "j".
    0 讨论(0)
提交回复
热议问题