PHP: How do I detect if an input string is Arabic

夙愿已清 提交于 2019-11-28 05:59:09
The Surrican

hmm i may offer an improved version of DimaKrasun's function:

functoin is_arabic($string) {
    if($string === 'arabic') {
         return true;
    }
    return false;
}

okay, enough joking!

Pekkas suggestion to use the google translate api is a good one! but you are relying on an external service which is always more complicated etc.

i think Rushyos approch is good! its just not that easy. i wrote the following function for you but its not tested, but it should work...

    <?
function uniord($u) {
    // i just copied this function fron the php.net comments, but it should work fine!
    $k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
    $k1 = ord(substr($k, 0, 1));
    $k2 = ord(substr($k, 1, 1));
    return $k2 * 256 + $k1;
}
function is_arabic($str) {
    if(mb_detect_encoding($str) !== 'UTF-8') {
        $str = mb_convert_encoding($str,mb_detect_encoding($str),'UTF-8');
    }

    /*
    $str = str_split($str); <- this function is not mb safe, it splits by bytes, not characters. we cannot use it
    $str = preg_split('//u',$str); <- this function woulrd probably work fine but there was a bug reported in some php version so it pslits by bytes and not chars as well
    */
    preg_match_all('/.|\n/u', $str, $matches);
    $chars = $matches[0];
    $arabic_count = 0;
    $latin_count = 0;
    $total_count = 0;
    foreach($chars as $char) {
        //$pos = ord($char); we cant use that, its not binary safe 
        $pos = uniord($char);
        echo $char ." --> ".$pos.PHP_EOL;

        if($pos >= 1536 && $pos <= 1791) {
            $arabic_count++;
        } else if($pos > 123 && $pos < 123) {
            $latin_count++;
        }
        $total_count++;
    }
    if(($arabic_count/$total_count) > 0.6) {
        // 60% arabic chars, its probably arabic
        return true;
    }
    return false;
}
$arabic = is_arabic('عربية إخبارية تعمل على مدار اليوم. يمكنك مشاهدة بث القناة من خلال الموقع'); 
var_dump($arabic);
?>

final thoughts: as you see i added for example a latin counter, the range is just a dummy number b ut this way you could detect charsets (hebrew, latin, arabic, hindi, chinese, etc...)

you may also want to eliminate some chars first... maybe @, space, line breaks, slashes etc... the PREG_SPLIT_NO_EMPTY flag for the preg_split function would be useful but because of the bug I didn't use it here.

you can as well have a counter for all the character sets and see which one of course the most...

and finally you should consider chopping your string off after 200 chars or something. this should be enough to tell what character set is used.

and you have to do some error handling! like division by zero, empty string etc etc! don't forget that please... any questions? comment!

if you want to detect the LANGUAGE of a string, you should split into words and check for the words in some pre-defined tables. you don't need a complete dictionary, just the most common words and it should work fine. tokenization/normalization is a must as well! there are libraries for that anyway and this is not what you asked for :) just wanted to mention it

this will check if the string is Arabic Or has Arabic text

text must be UNICODE e.g UTF-8

$str = "بسم الله";
if (preg_match('/[اأإء-ي]/ui', $str)) {
    echo "A match was found.";
} else {
    echo "A match was not found.";
}
Dmytro Krasun

You can use function, which i have written for you:

<?php
/**
 * Return`s true if string contains only arabic letters.
 *
 * @param string $string
 * @return bool
 */
function is_arabic($string)
{
    return (preg_match("/^\p{Arabic}/i", $string) > 0);
}

But please, check it, before use.

[EDIT 1]

Your question: "How do I detect if an input string is Arabic?" And i have answered to it, what`s wrong?

[EDIT 2]

Read this - Detect language from string in PHP

[EDIT 3]

Excuse, i rewrite function to this, try it:

function is_arabic($subject)
{
    return (preg_match("/^[\x0600-\x06FF]/i", $subject) > 0);
}

I'm not aware of a PHP solution for this, no.

The Google Translate Ajax APIs may be for you, though.

Check out this Javascript snippet from the API docs: Example: Language Detection

I assume you're referring to a Unicode string... in which case, just look for the presence of any character with a code between U+0600–U+06FF (1536–1791) in the string.

public static function isArabic($string){
    if(preg_match('/\p{Arabic}/u', $string))
        return true;
    return false;
}

The PHP Text_LanguageDetect library is able to detect 52 languages. It's unit-tested and installable via composer and PEAR.

Use regular expression for shorter and easy answer

 $is_arabic = preg_match('/\p{Arabic}/u', $text);

This will return true (1) for arabic string and 0 for non arabic string

This function checks whether the entered line/sentence is arabic or not. I trimmed it first then check word by word calculating the total count for both.

function isArabic($string){
        // Initializing count variables with zero
        $arabicCount = 0;
        $englishCount = 0;
        // Getting the cleanest String without any number or Brackets or Hyphen
        $noNumbers = preg_replace('/[0-9]+/', '', $string);
        $noBracketsHyphen = array('(', ')', '-');
        $clean = trim(str_replace($noBracketsHyphen , '', $noNumbers));
        // After Getting the clean string, splitting it by space to get the total entered words 
        $array = explode(" ", $clean); // $array contain the words that was entered by the user
        for ($i=0; $i <= count($array) ; $i++) {
            // Checking either word is Arabic or not
            $checkLang = preg_match('/\p{Arabic}/u', $array[$i]);
            if($checkLang == 1){
                ++$arabicCount;
            } else{
                ++$englishCount;
            }
        }
        if($arabicCount >= $englishCount){
            // Return 1 means TRUE i-e Arabic
            return 1;
        } else{
            // Return 0 means FALSE i-e English
            return 0;
        }
    }

I would use regular expressions to get the number of Arabic characters and compare it to the total length of the string. If the text is for instance at least 60% Arabic charactes, I would consider it as mainly Arabic and apply RTL formatting.

/**
 * Is the given text mainly Arabic language? 
 *
 * @param string $text string to be tested if it is arabic. :-)
 * @return bool 
 */
function ct_is_arabic_text($text) {
    $text = preg_replace('/[ 0-9\(\)\.\,\-\:\n\r_]/', '', $text); // Remove spaces, numbers, punctuation.
    $total_count = mb_strlen($text); // Length of text
    if ($total_count==0)
        return false;
    $arabic_count = preg_match_all("/[اأإء-ي]/ui", $text, $matches); // Number of Arabic characters
    if(($arabic_count/$total_count) > 0.6) { // >60% Arabic chars, its probably Arabic languages
        return true;
    }
    return false;
}

For inline RTL formatting, use CSS. Example class:

.embed-rtl {
 direction: rtl;
 unicode-bidi: normal;
 text-align: right;
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!