I need to identify what natural language my input belongs to. The goal is to distinguish between Arabic and English words in a mixed input, where the inpu
English characters tend to be in these 4 Unicode blocks:
ArrayList english = new ArrayList<>();
english.add(Character.UnicodeBlock.BASIC_LATIN);
english.add(Character.UnicodeBlock.LATIN_1_SUPPLEMENT);
english.add(Character.UnicodeBlock.LATIN_EXTENDED_A);
english.add(Character.UnicodeBlock.GENERAL_PUNCTUATION);
So if you have a String, you can loop over all the characters and see what Unicode block each character is in:
for (char currentChar : myString.toCharArray())
{
Character.UnicodeBlock unicodeBlock = Character.UnicodeBlock.of(currentChar);
if (english.contains(unicodeBlock))
{
// This character is English
}
}
If they are all English, then you know you have characters that all English. You could repeat this for any language; you'll just have to figure out what Unicode blocks each language uses.
Note: This does NOT mean that you've proven the language is English. You've only proven it uses characters found in English. It could be French, German, Spanish, or other languages whose characters have a lot of overlap with English.
There are other ways to detect the actual natural language. Libraries like langdetect, which I have used with great success, can do this for you:
https://code.google.com/p/language-detection/