How to determine a string is English or Persian?

后端 未结 5 908
南笙
南笙 2021-01-11 17:45

I have edittext in a form, I want that when the user inputs text into the edittext for my program to detect which language was inserted into the edittext.

Is there a

相关标签:
5条回答
  • 2021-01-11 18:12

    All possible Unicode ranges for Persian (also for Urdu) alphabet:

    • 0x0600 to 0x06FF

    • 0xFB50 to 0xFDFF

    • 0xFE70 to 0xFEFF

      So if you want don't miss any char check all ranges. Hope helps you.

    0 讨论(0)
  • 2021-01-11 18:14

    Why don't you evaluate it when keyboard is popup.. Means You can do it by getting the language of phone... here is the method useLocale.getDefault().getDisplayLanguage(); minSDK is 11 is required.

    0 讨论(0)
  • 2021-01-11 18:16

    There's no exact way to determine what language your user is typing in unless you get really complicated, hence why the method example you've given is called isProbablyArabic rather than isArabic. If your users are writing exclusively in English or Farsi and nothing else, one option is to use a regex that looks to see if the user's text contains Western Roman characters ("^[a-zA-Z]*$"), if this returns false you can assume they've typed in Persian, though it could be anything that uses a different character set.

    0 讨论(0)
  • 2021-01-11 18:19

    You can know a string is english or persian by using Regex.

    public static final Pattern VALID_NAME_PATTERN_REGEX = Pattern.compile("[a-zA-Z_0-9]+$");
    
    public static boolean isEnglishWord(String string) {
        return VALID_NAME_PATTERN_REGEX.matcher(string).find();
    }
    

    this only works with words and numbers. if there is a character like '=' or '+' , the function would return false . you can fix that by editing the regex to match what you need .

    0 讨论(0)
  • 2021-01-11 18:33

    Using characters' range is not a perfect way to detect some languages that have overlapped range e.g Arabic, Persian and Urdu. But, if you insist on this way, my suggestion is looking for especial characters that are language-specific. For example, گ or پ are in Persian but are not in Arabic. On the other hand, ئ or ة maybe more common in Arabic text than Persian. By counting these specific characters you can distinguish between Arabic, Persian and Urdu.

    Although I've got good results from the mentioned method, using n-grams to detect a language is more popular and dependable. There are many libraries that do language detection task by this method.

    0 讨论(0)
提交回复
热议问题