How to determine a string is english or arabic?

后端未结

关注

 8  1181

我寻月下人不归

Is there a way to determine a string is English or Arabic?

相关标签:

8条回答

轮回少年

2021-01-31 18:00

English characters tend to be in these 4 Unicode blocks:

BASIC_LATIN
LATIN_1_SUPPLEMENT
LATIN_EXTENDED_A

GENERAL_PUNCTUATION

public static boolean isEnglish(String text) {

 boolean onlyEnglish = false;

 for (char character : text.toCharArray()) {

    if (Character.UnicodeBlock.of(character) == Character.UnicodeBlock.BASIC_LATIN
            || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.LATIN_1_SUPPLEMENT
            || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.LATIN_EXTENDED_A
            || Character.UnicodeBlock.of(character) == Character.UnicodeBlock.GENERAL_PUNCTUATION) {

        onlyEnglish = true;
    } else {

        onlyEnglish = false;
    }
 }

  return onlyEnglish;
}

0 讨论(0)

时光取名叫无心

2021-01-31 18:04

Java in itself supports various language checks by unicode, Arabic is also supported. Much simpler and smallest way to do the same is by UnicodeBlock

public static boolean textContainsArabic(String text) {
    for (char charac : text.toCharArray()) {
        if (Character.UnicodeBlock.of(charac) == Character.UnicodeBlock.ARABIC) {
            return true;
        }
    }
    return false;
}

0 讨论(0)

猫巷女王i

2021-01-31 18:05

Try This :

internal static bool ContainsArabicLetters(string text)

{

foreach (char character in text.ToCharArray())
{
    if (character >= 0x600 && character <= 0x6ff)
        return true;
    if (character >= 0x750 && character <= 0x77f)
        return true;
    if (character >= 0xfb50 && character <= 0xfc3f)
        return true;
    if (character >= 0xfe70 && character <= 0xfefc)
        return true;
}
return false;
}

0 讨论(0)

感情败类

2021-01-31 18:10

You could use N-gram-based text categorization (google for that phrase) but it is not a fail-proof technique, and it may require a not too short string.

You might also decide that a string with only ASCII letters is not Arabic.

0 讨论(0)
发布评论:

提交评论
- 加载中...

孤街浪徒

2021-01-31 18:10

This answer is somewhat correct. But when we combine Farsi and English letters it returns TRUE!, which is not true. Here I modified the same method so that it works well

 public static boolean isProbablyArabic(String s) {
    for (int i = 0; i < s.length();) {
        int c = s.codePointAt(i);
        if (!(c >= 0x0600 && c <= 0x06E0))
            return false;
        i += Character.charCount(c);
    }
    return true;
}

0 讨论(0)

傲寒

2021-01-31 18:16
Here is a simple logic that I just tried:
```
  public static boolean isProbablyArabic(String s) {
    for (int i = 0; i < s.length();) {
        int c = s.codePointAt(i);
        if (c >= 0x0600 && c <= 0x06E0)
            return true;
        i += Character.charCount(c);            
    }
    return false;
  }
```
It declares the text as arabic if and only if an arabic unicode code point is found in the text. You can enhance this logic to be more suitable for your needs.

The range 0600 - 06E0 is the code point range of Arabic characters and symbols (See Unicode tables)
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页