How to determine a string is English or Persian?

后端未结

关注

 5  930

I have edittext in a form, I want that when the user inputs text into the edittext for my program to detect which language was inserted into the edittext.

Is there a

相关标签:

5条回答

清酒与你

2021-01-11 18:12
All possible Unicode ranges for Persian (also for Urdu) alphabet:
- 0x0600 to 0x06FF
- 0xFB50 to 0xFDFF
- 0xFE70 to 0xFEFF
  
  So if you want don't miss any char check all ranges. Hope helps you.
0 讨论(0)
发布评论:

提交评论
- 加载中...
礼貌的吻别

2021-01-11 18:14

Why don't you evaluate it when keyboard is popup.. Means You can do it by getting the language of phone... here is the method useLocale.getDefault().getDisplayLanguage(); minSDK is 11 is required.

0 讨论(0)
发布评论:

提交评论
- 加载中...
隐瞒了意图╮

2021-01-11 18:16

There's no exact way to determine what language your user is typing in unless you get really complicated, hence why the method example you've given is called isProbablyArabic rather than isArabic. If your users are writing exclusively in English or Farsi and nothing else, one option is to use a regex that looks to see if the user's text contains Western Roman characters ("^[a-zA-Z]*$"), if this returns false you can assume they've typed in Persian, though it could be anything that uses a different character set.

0 讨论(0)
发布评论:

提交评论
- 加载中...
生来不讨喜

2021-01-11 18:19
You can know a string is english or persian by using Regex.
```
public static final Pattern VALID_NAME_PATTERN_REGEX = Pattern.compile("[a-zA-Z_0-9]+$");

public static boolean isEnglishWord(String string) {
    return VALID_NAME_PATTERN_REGEX.matcher(string).find();
}
```
this only works with words and numbers. if there is a character like '=' or '+' , the function would return false . you can fix that by editing the regex to match what you need .
0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2021-01-11 18:33

Using characters' range is not a perfect way to detect some languages that have overlapped range e.g Arabic, Persian and Urdu. But, if you insist on this way, my suggestion is looking for especial characters that are language-specific. For example, گ or پ are in Persian but are not in Arabic. On the other hand, ئ or ة maybe more common in Arabic text than Persian. By counting these specific characters you can distinguish between Arabic, Persian and Urdu.

Although I've got good results from the mentioned method, using n-grams to detect a language is more popular and dependable. There are many libraries that do language detection task by this method.

0 讨论(0)
发布评论:

提交评论
- 加载中...