Java: how to check if character belongs to a specific unicode block?

前端 未结 5 372
故里飘歌
故里飘歌 2021-01-04 02:43

I need to identify what natural language my input belongs to. The goal is to distinguish between Arabic and English words in a mixed input, where the inpu

相关标签:
5条回答
  • 2021-01-04 03:16

    Yes, you can simply use Character.UnicodeBlock.of(char)

    0 讨论(0)
  • 2021-01-04 03:25

    You have the opposite problem to this one, but ironically what doesn't work for him it just should work great for you. It is to just look for words in English (only ASCII compatible chars) with reg-exp "\w".

    0 讨论(0)
  • 2021-01-04 03:26

    The Unicode Script property is probably more useful. In Java, it can be looked up using the java.lang.Character.UnicodeScript class:

    Character.UnicodeScript script = Character.UnicodeScript.of(c);
    
    0 讨论(0)
  • 2021-01-04 03:31

    If [A-Za-z]+ meets your requirement, you aren't going to find anything faster or prettier. However, if you want to match all letters in the Latin1 block (including accented letters and ligatures), you can use this:

    Pattern p = Pattern.compile("[\\pL&&\\p{L1}]+");
    

    That's the intersection of the set of all Unicode letters and the set of all Latin1 characters.

    0 讨论(0)
  • 2021-01-04 03:31

    English characters tend to be in these 4 Unicode blocks:

    ArrayList<Character.UnicodeBlock> english = new ArrayList<>();
    english.add(Character.UnicodeBlock.BASIC_LATIN);
    english.add(Character.UnicodeBlock.LATIN_1_SUPPLEMENT);
    english.add(Character.UnicodeBlock.LATIN_EXTENDED_A);
    english.add(Character.UnicodeBlock.GENERAL_PUNCTUATION);
    

    So if you have a String, you can loop over all the characters and see what Unicode block each character is in:

    for (char currentChar : myString.toCharArray())  
    {
        Character.UnicodeBlock unicodeBlock = Character.UnicodeBlock.of(currentChar);
        if (english.contains(unicodeBlock))
        {
            // This character is English
        }
    }
    

    If they are all English, then you know you have characters that all English. You could repeat this for any language; you'll just have to figure out what Unicode blocks each language uses.

    Note: This does NOT mean that you've proven the language is English. You've only proven it uses characters found in English. It could be French, German, Spanish, or other languages whose characters have a lot of overlap with English.

    There are other ways to detect the actual natural language. Libraries like langdetect, which I have used with great success, can do this for you:

    https://code.google.com/p/language-detection/

    0 讨论(0)
提交回复
热议问题