Is there a way to determine a string is English or Arabic?
English characters tend to be in these 4 Unicode blocks:
GENERAL_PUNCTUATION
public static boolean isEnglish(String text) {
boolean onlyEnglish = false;
for (char character : text.toCharArray()) {
if (Character.UnicodeBlock.of(character) == Character.UnicodeBlock.BASIC_LATIN
|| Character.UnicodeBlock.of(character) == Character.UnicodeBlock.LATIN_1_SUPPLEMENT
|| Character.UnicodeBlock.of(character) == Character.UnicodeBlock.LATIN_EXTENDED_A
|| Character.UnicodeBlock.of(character) == Character.UnicodeBlock.GENERAL_PUNCTUATION) {
onlyEnglish = true;
} else {
onlyEnglish = false;
}
}
return onlyEnglish;
}
Java in itself supports various language checks by unicode, Arabic is also supported. Much simpler and smallest way to do the same is by UnicodeBlock
public static boolean textContainsArabic(String text) {
for (char charac : text.toCharArray()) {
if (Character.UnicodeBlock.of(charac) == Character.UnicodeBlock.ARABIC) {
return true;
}
}
return false;
}
Try This :
internal static bool ContainsArabicLetters(string text)
{
foreach (char character in text.ToCharArray())
{
if (character >= 0x600 && character <= 0x6ff)
return true;
if (character >= 0x750 && character <= 0x77f)
return true;
if (character >= 0xfb50 && character <= 0xfc3f)
return true;
if (character >= 0xfe70 && character <= 0xfefc)
return true;
}
return false;
}
You could use N-gram-based text categorization (google for that phrase) but it is not a fail-proof technique, and it may require a not too short string.
You might also decide that a string with only ASCII letters is not Arabic.
This answer is somewhat correct. But when we combine Farsi and English letters it returns TRUE!, which is not true. Here I modified the same method so that it works well
public static boolean isProbablyArabic(String s) {
for (int i = 0; i < s.length();) {
int c = s.codePointAt(i);
if (!(c >= 0x0600 && c <= 0x06E0))
return false;
i += Character.charCount(c);
}
return true;
}
Here is a simple logic that I just tried:
public static boolean isProbablyArabic(String s) {
for (int i = 0; i < s.length();) {
int c = s.codePointAt(i);
if (c >= 0x0600 && c <= 0x06E0)
return true;
i += Character.charCount(c);
}
return false;
}
It declares the text as arabic if and only if an arabic unicode code point is found in the text. You can enhance this logic to be more suitable for your needs.
The range 0600 - 06E0 is the code point range of Arabic characters and symbols (See Unicode tables)