Replacing accented characters php

后端 未结 19 1122
鱼传尺愫
鱼传尺愫 2020-11-22 16:03

I am trying to replace accented characters with the normal replacements. Below is what I am currently doing.

    $string = \"Éric Cantona\";
    $strict = st         


        
相关标签:
19条回答
  • 2020-11-22 16:58

    This worked for me:

    <?php
    setlocale(LC_ALL, "en_US.utf8"); 
    $val = iconv('UTF-8','ASCII//TRANSLIT',$val);
    ?>
    
    0 讨论(0)
  • 2020-11-22 17:04

    As an alternative (a bit more complex in nature through), have a look at how wordpress does accent removal. Made some changes below to make it run independently without referencing wordpress functions.

         function mbstring_binary_safe_encoding($reset = false)
    {
        static $encodings  = array();
        static $overloaded = null;
    
        if (is_null($overloaded)) {
            $overloaded = function_exists('mb_internal_encoding') && (ini_get('mbstring.func_overload') & 2);
        }
    
        if (false === $overloaded) {
            return;
        }
    
        if (!$reset) {
            $encoding = mb_internal_encoding();
            array_push($encodings, $encoding);
            mb_internal_encoding('ISO-8859-1');
        }
    
        if ($reset && $encodings) {
            $encoding = array_pop($encodings);
            mb_internal_encoding($encoding);
        }
    }
    
    function seems_utf8($str)
    {
        mbstring_binary_safe_encoding();
        $length = strlen($str);
        mbstring_binary_safe_encoding(true);
        for ($i = 0; $i < $length; $i++) {
            $c = ord($str[$i]);
            if ($c < 0x80) {
                $n = 0;
            }
            // 0bbbbbbb
            elseif (($c & 0xE0) == 0xC0) {
                $n = 1;
            }
            // 110bbbbb
            elseif (($c & 0xF0) == 0xE0) {
                $n = 2;
            }
            // 1110bbbb
            elseif (($c & 0xF8) == 0xF0) {
                $n = 3;
            }
            // 11110bbb
            elseif (($c & 0xFC) == 0xF8) {
                $n = 4;
            }
            // 111110bb
            elseif (($c & 0xFE) == 0xFC) {
                $n = 5;
            }
            // 1111110b
            else {
                    return false;
                }
                // Does not match any model
                for ($j = 0; $j < $n; $j++) {
                    // n bytes matching 10bbbbbb follow ?
                    if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80)) {
                        return false;
                    }
    
                }
            }
            return true;
        }
    
        function remove_accents($string)
    {
            if (!preg_match('/[\x80-\xff]/', $string)) {
                return $string;
            }
    
            if (seems_utf8($string)) {
                $chars = array(
                    // Decompositions for Latin-1 Supplement
                    'ª' => 'a', 'º'  => 'o',
                    'À' => 'A', 'Á'  => 'A',
                    'Â' => 'A', 'Ã'  => 'A',
                    'Ä' => 'A', 'Å'  => 'A',
                    'Æ' => 'AE', 'Ç' => 'C',
                    'È' => 'E', 'É'  => 'E',
                    'Ê' => 'E', 'Ë'  => 'E',
                    'Ì' => 'I', 'Í'  => 'I',
                    'Î' => 'I', 'Ï'  => 'I',
                    'Ð' => 'D', 'Ñ'  => 'N',
                    'Ò' => 'O', 'Ó'  => 'O',
                    'Ô' => 'O', 'Õ'  => 'O',
                    'Ö' => 'O', 'Ù'  => 'U',
                    'Ú' => 'U', 'Û'  => 'U',
                    'Ü' => 'U', 'Ý'  => 'Y',
                    'Þ' => 'TH', 'ß' => 's',
                    'à' => 'a', 'á'  => 'a',
                    'â' => 'a', 'ã'  => 'a',
                    'ä' => 'a', 'å'  => 'a',
                    'æ' => 'ae', 'ç' => 'c',
                    'è' => 'e', 'é'  => 'e',
                    'ê' => 'e', 'ë'  => 'e',
                    'ì' => 'i', 'í'  => 'i',
                    'î' => 'i', 'ï'  => 'i',
                    'ð' => 'd', 'ñ'  => 'n',
                    'ò' => 'o', 'ó'  => 'o',
                    'ô' => 'o', 'õ'  => 'o',
                    'ö' => 'o', 'ø'  => 'o',
                    'ù' => 'u', 'ú'  => 'u',
                    'û' => 'u', 'ü'  => 'u',
                    'ý' => 'y', 'þ'  => 'th',
                    'ÿ' => 'y', 'Ø'  => 'O',
                    // Decompositions for Latin Extended-A
                    'Ā' => 'A', 'ā'  => 'a',
                    'Ă' => 'A', 'ă'  => 'a',
                    'Ą' => 'A', 'ą'  => 'a',
                    'Ć' => 'C', 'ć'  => 'c',
                    'Ĉ' => 'C', 'ĉ'  => 'c',
                    'Ċ' => 'C', 'ċ'  => 'c',
                    'Č' => 'C', 'č'  => 'c',
                    'Ď' => 'D', 'ď'  => 'd',
                    'Đ' => 'D', 'đ'  => 'd',
                    'Ē' => 'E', 'ē'  => 'e',
                    'Ĕ' => 'E', 'ĕ'  => 'e',
                    'Ė' => 'E', 'ė'  => 'e',
                    'Ę' => 'E', 'ę'  => 'e',
                    'Ě' => 'E', 'ě'  => 'e',
                    'Ĝ' => 'G', 'ĝ'  => 'g',
                    'Ğ' => 'G', 'ğ'  => 'g',
                    'Ġ' => 'G', 'ġ'  => 'g',
                    'Ģ' => 'G', 'ģ'  => 'g',
                    'Ĥ' => 'H', 'ĥ'  => 'h',
                    'Ħ' => 'H', 'ħ'  => 'h',
                    'Ĩ' => 'I', 'ĩ'  => 'i',
                    'Ī' => 'I', 'ī'  => 'i',
                    'Ĭ' => 'I', 'ĭ'  => 'i',
                    'Į' => 'I', 'į'  => 'i',
                    'İ' => 'I', 'ı'  => 'i',
                    'IJ' => 'IJ', 'ij' => 'ij',
                    'Ĵ' => 'J', 'ĵ'  => 'j',
                    'Ķ' => 'K', 'ķ'  => 'k',
                    'ĸ' => 'k', 'Ĺ'  => 'L',
                    'ĺ' => 'l', 'Ļ'  => 'L',
                    'ļ' => 'l', 'Ľ'  => 'L',
                    'ľ' => 'l', 'Ŀ'  => 'L',
                    'ŀ' => 'l', 'Ł'  => 'L',
                    'ł' => 'l', 'Ń'  => 'N',
                    'ń' => 'n', 'Ņ'  => 'N',
                    'ņ' => 'n', 'Ň'  => 'N',
                    'ň' => 'n', 'ʼn'  => 'n',
                    'Ŋ' => 'N', 'ŋ'  => 'n',
                    'Ō' => 'O', 'ō'  => 'o',
                    'Ŏ' => 'O', 'ŏ'  => 'o',
                    'Ő' => 'O', 'ő'  => 'o',
                    'Œ' => 'OE', 'œ' => 'oe',
                    'Ŕ' => 'R', 'ŕ'  => 'r',
                    'Ŗ' => 'R', 'ŗ'  => 'r',
                    'Ř' => 'R', 'ř'  => 'r',
                    'Ś' => 'S', 'ś'  => 's',
                    'Ŝ' => 'S', 'ŝ'  => 's',
                    'Ş' => 'S', 'ş'  => 's',
                    'Š' => 'S', 'š'  => 's',
                    'Ţ' => 'T', 'ţ'  => 't',
                    'Ť' => 'T', 'ť'  => 't',
                    'Ŧ' => 'T', 'ŧ'  => 't',
                    'Ũ' => 'U', 'ũ'  => 'u',
                    'Ū' => 'U', 'ū'  => 'u',
                    'Ŭ' => 'U', 'ŭ'  => 'u',
                    'Ů' => 'U', 'ů'  => 'u',
                    'Ű' => 'U', 'ű'  => 'u',
                    'Ų' => 'U', 'ų'  => 'u',
                    'Ŵ' => 'W', 'ŵ'  => 'w',
                    'Ŷ' => 'Y', 'ŷ'  => 'y',
                    'Ÿ' => 'Y', 'Ź'  => 'Z',
                    'ź' => 'z', 'Ż'  => 'Z',
                    'ż' => 'z', 'Ž'  => 'Z',
                    'ž' => 'z', 'ſ'  => 's',
                    // Decompositions for Latin Extended-B
                    'Ș' => 'S', 'ș'  => 's',
                    'Ț' => 'T', 'ț'  => 't',
                    // Euro Sign
                    '€' => 'E',
                    // GBP (Pound) Sign
                    '£' => '',
                    // Vowels with diacritic (Vietnamese)
                    // unmarked
                    'Ơ' => 'O', 'ơ'  => 'o',
                    'Ư' => 'U', 'ư'  => 'u',
                    // grave accent
                    'Ầ' => 'A', 'ầ'  => 'a',
                    'Ằ' => 'A', 'ằ'  => 'a',
                    'Ề' => 'E', 'ề'  => 'e',
                    'Ồ' => 'O', 'ồ'  => 'o',
                    'Ờ' => 'O', 'ờ'  => 'o',
                    'Ừ' => 'U', 'ừ'  => 'u',
                    'Ỳ' => 'Y', 'ỳ'  => 'y',
                    // hook
                    'Ả' => 'A', 'ả'  => 'a',
                    'Ẩ' => 'A', 'ẩ'  => 'a',
                    'Ẳ' => 'A', 'ẳ'  => 'a',
                    'Ẻ' => 'E', 'ẻ'  => 'e',
                    'Ể' => 'E', 'ể'  => 'e',
                    'Ỉ' => 'I', 'ỉ'  => 'i',
                    'Ỏ' => 'O', 'ỏ'  => 'o',
                    'Ổ' => 'O', 'ổ'  => 'o',
                    'Ở' => 'O', 'ở'  => 'o',
                    'Ủ' => 'U', 'ủ'  => 'u',
                    'Ử' => 'U', 'ử'  => 'u',
                    'Ỷ' => 'Y', 'ỷ'  => 'y',
                    // tilde
                    'Ẫ' => 'A', 'ẫ'  => 'a',
                    'Ẵ' => 'A', 'ẵ'  => 'a',
                    'Ẽ' => 'E', 'ẽ'  => 'e',
                    'Ễ' => 'E', 'ễ'  => 'e',
                    'Ỗ' => 'O', 'ỗ'  => 'o',
                    'Ỡ' => 'O', 'ỡ'  => 'o',
                    'Ữ' => 'U', 'ữ'  => 'u',
                    'Ỹ' => 'Y', 'ỹ'  => 'y',
                    // acute accent
                    'Ấ' => 'A', 'ấ'  => 'a',
                    'Ắ' => 'A', 'ắ'  => 'a',
                    'Ế' => 'E', 'ế'  => 'e',
                    'Ố' => 'O', 'ố'  => 'o',
                    'Ớ' => 'O', 'ớ'  => 'o',
                    'Ứ' => 'U', 'ứ'  => 'u',
                    // dot below
                    'Ạ' => 'A', 'ạ'  => 'a',
                    'Ậ' => 'A', 'ậ'  => 'a',
                    'Ặ' => 'A', 'ặ'  => 'a',
                    'Ẹ' => 'E', 'ẹ'  => 'e',
                    'Ệ' => 'E', 'ệ'  => 'e',
                    'Ị' => 'I', 'ị'  => 'i',
                    'Ọ' => 'O', 'ọ'  => 'o',
                    'Ộ' => 'O', 'ộ'  => 'o',
                    'Ợ' => 'O', 'ợ'  => 'o',
                    'Ụ' => 'U', 'ụ'  => 'u',
                    'Ự' => 'U', 'ự'  => 'u',
                    'Ỵ' => 'Y', 'ỵ'  => 'y',
                    // Vowels with diacritic (Chinese, Hanyu Pinyin)
                    'ɑ' => 'a',
                    // macron
                    'Ǖ' => 'U', 'ǖ'  => 'u',
                    // acute accent
                    'Ǘ' => 'U', 'ǘ'  => 'u',
                    // caron
                    'Ǎ' => 'A', 'ǎ'  => 'a',
                    'Ǐ' => 'I', 'ǐ'  => 'i',
                    'Ǒ' => 'O', 'ǒ'  => 'o',
                    'Ǔ' => 'U', 'ǔ'  => 'u',
                    'Ǚ' => 'U', 'ǚ'  => 'u',
                    // grave accent
                    'Ǜ' => 'U', 'ǜ'  => 'u',
                );
    
                $string = strtr($string, $chars);
            } else {
                $chars = array();
                // Assume ISO-8859-1 if not UTF-8
                $chars['in'] = "\x80\x83\x8a\x8e\x9a\x9e"
                    . "\x9f\xa2\xa5\xb5\xc0\xc1\xc2"
                    . "\xc3\xc4\xc5\xc7\xc8\xc9\xca"
                    . "\xcb\xcc\xcd\xce\xcf\xd1\xd2"
                    . "\xd3\xd4\xd5\xd6\xd8\xd9\xda"
                    . "\xdb\xdc\xdd\xe0\xe1\xe2\xe3"
                    . "\xe4\xe5\xe7\xe8\xe9\xea\xeb"
                    . "\xec\xed\xee\xef\xf1\xf2\xf3"
                    . "\xf4\xf5\xf6\xf8\xf9\xfa\xfb"
                    . "\xfc\xfd\xff";
    
                $chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";
    
                $string              = strtr($string, $chars['in'], $chars['out']);
                $double_chars        = array();
                $double_chars['in']  = array("\x8c", "\x9c", "\xc6", "\xd0", "\xde", "\xdf", "\xe6", "\xf0", "\xfe");
                $double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
                $string              = str_replace($double_chars['in'], $double_chars['out'], $string);
            }
    
            return $string;
        }
    
    0 讨论(0)
  • 2020-11-22 17:05

    I know, that question has been asked a long long time ago...

    I was looking for a short and elegant solution, but couldn't find satisfaction for two reasons:

    First, most of the existing solutions replace a list of characters by a list of other characters. Unfortunately, it require to use a specific encoding for the php script file itself which might be unwanted.

    Second, using iconv seems to be a good way, but it's not enough as the result of a converted character could be one or two characters, or a Fatal Exception.

    So I wrote that small function which does the job :

    function replaceAccent($string, $replacement = '_')
    {
        $alnumPattern = '/^[a-zA-Z0-9 ]+$/';
    
        if (preg_match($alnumPattern, $string)) {
            return $string;
        }
    
        $ret = array_map(
            function ($chr) use ($alnumPattern, $replacement) {
                if (preg_match($alnumPattern, $chr)) {
                    return $chr;
                } else {
                    $chr = @iconv('ISO-8859-1', 'ASCII//TRANSLIT', $chr);
                    if (strlen($chr) == 1) {
                        return $chr;
                    } elseif (strlen($chr) > 1) {
                        $ret = '';
                        foreach (str_split($chr) as $char2) {
                            if (preg_match($alnumPattern, $char2)) {
                                $ret .= $char2;
                            }
                        }
                        return $ret;
                    } else {
                        // replace whatever iconv fail to convert by something else
                        return $replacement;
                    }
                }
            },
            str_split($string)
        );
    
        return implode($ret);
    }
    
    0 讨论(0)
  • 2020-11-22 17:07

    You can use PHP strtr() function to get rid of accented characters :

    $string = "Éric Cantona";
    $accented_array = array('Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E','Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U','Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c','è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o','ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
    
    $required_str = strtr( $string, $accented_array );
    
    0 讨论(0)
  • 2020-11-22 17:09

    An updated answer based on @BurninLeo's answer

    function replace_spec_char($subject) {
        $char_map = array(
            "ъ" => "-", "ь" => "-", "Ъ" => "-", "Ь" => "-",
            "А" => "A", "Ă" => "A", "Ǎ" => "A", "Ą" => "A", "À" => "A", "Ã" => "A", "Á" => "A", "Æ" => "A", "Â" => "A", "Å" => "A", "Ǻ" => "A", "Ā" => "A", "א" => "A",
            "Б" => "B", "ב" => "B", "Þ" => "B",
            "Ĉ" => "C", "Ć" => "C", "Ç" => "C", "Ц" => "C", "צ" => "C", "Ċ" => "C", "Č" => "C", "©" => "C", "ץ" => "C",
            "Д" => "D", "Ď" => "D", "Đ" => "D", "ד" => "D", "Ð" => "D",
            "È" => "E", "Ę" => "E", "É" => "E", "Ë" => "E", "Ê" => "E", "Е" => "E", "Ē" => "E", "Ė" => "E", "Ě" => "E", "Ĕ" => "E", "Є" => "E", "Ə" => "E", "ע" => "E",
            "Ф" => "F", "Ƒ" => "F",
            "Ğ" => "G", "Ġ" => "G", "Ģ" => "G", "Ĝ" => "G", "Г" => "G", "ג" => "G", "Ґ" => "G",
            "ח" => "H", "Ħ" => "H", "Х" => "H", "Ĥ" => "H", "ה" => "H",
            "I" => "I", "Ï" => "I", "Î" => "I", "Í" => "I", "Ì" => "I", "Į" => "I", "Ĭ" => "I", "I" => "I", "И" => "I", "Ĩ" => "I", "Ǐ" => "I", "י" => "I", "Ї" => "I", "Ī" => "I", "І" => "I",
            "Й" => "J", "Ĵ" => "J",
            "ĸ" => "K", "כ" => "K", "Ķ" => "K", "К" => "K", "ך" => "K",
            "Ł" => "L", "Ŀ" => "L", "Л" => "L", "Ļ" => "L", "Ĺ" => "L", "Ľ" => "L", "ל" => "L",
            "מ" => "M", "М" => "M", "ם" => "M",
            "Ñ" => "N", "Ń" => "N", "Н" => "N", "Ņ" => "N", "ן" => "N", "Ŋ" => "N", "נ" => "N", "ʼn" => "N", "Ň" => "N",
            "Ø" => "O", "Ó" => "O", "Ò" => "O", "Ô" => "O", "Õ" => "O", "О" => "O", "Ő" => "O", "Ŏ" => "O", "Ō" => "O", "Ǿ" => "O", "Ǒ" => "O", "Ơ" => "O",
            "פ" => "P", "ף" => "P", "П" => "P",
            "ק" => "Q",
            "Ŕ" => "R", "Ř" => "R", "Ŗ" => "R", "ר" => "R", "Р" => "R", "®" => "R",
            "Ş" => "S", "Ś" => "S", "Ș" => "S", "Š" => "S", "С" => "S", "Ŝ" => "S", "ס" => "S",
            "Т" => "T", "Ț" => "T", "ט" => "T", "Ŧ" => "T", "ת" => "T", "Ť" => "T", "Ţ" => "T",
            "Ù" => "U", "Û" => "U", "Ú" => "U", "Ū" => "U", "У" => "U", "Ũ" => "U", "Ư" => "U", "Ǔ" => "U", "Ų" => "U", "Ŭ" => "U", "Ů" => "U", "Ű" => "U", "Ǖ" => "U", "Ǜ" => "U", "Ǚ" => "U", "Ǘ" => "U",
            "В" => "V", "ו" => "V",
            "Ý" => "Y", "Ы" => "Y", "Ŷ" => "Y", "Ÿ" => "Y",
            "Ź" => "Z", "Ž" => "Z", "Ż" => "Z", "З" => "Z", "ז" => "Z",
            "а" => "a", "ă" => "a", "ǎ" => "a", "ą" => "a", "à" => "a", "ã" => "a", "á" => "a", "æ" => "a", "â" => "a", "å" => "a", "ǻ" => "a", "ā" => "a", "א" => "a",
            "б" => "b", "ב" => "b", "þ" => "b",
            "ĉ" => "c", "ć" => "c", "ç" => "c", "ц" => "c", "צ" => "c", "ċ" => "c", "č" => "c", "©" => "c", "ץ" => "c",
            "Ч" => "ch", "ч" => "ch",
            "д" => "d", "ď" => "d", "đ" => "d", "ד" => "d", "ð" => "d",
            "è" => "e", "ę" => "e", "é" => "e", "ë" => "e", "ê" => "e", "е" => "e", "ē" => "e", "ė" => "e", "ě" => "e", "ĕ" => "e", "є" => "e", "ə" => "e", "ע" => "e",
            "ф" => "f", "ƒ" => "f",
            "ğ" => "g", "ġ" => "g", "ģ" => "g", "ĝ" => "g", "г" => "g", "ג" => "g", "ґ" => "g",
            "ח" => "h", "ħ" => "h", "х" => "h", "ĥ" => "h", "ה" => "h",
            "i" => "i", "ï" => "i", "î" => "i", "í" => "i", "ì" => "i", "į" => "i", "ĭ" => "i", "ı" => "i", "и" => "i", "ĩ" => "i", "ǐ" => "i", "י" => "i", "ї" => "i", "ī" => "i", "і" => "i",
            "й" => "j", "Й" => "j", "Ĵ" => "j", "ĵ" => "j",
            "ĸ" => "k", "כ" => "k", "ķ" => "k", "к" => "k", "ך" => "k",
            "ł" => "l", "ŀ" => "l", "л" => "l", "ļ" => "l", "ĺ" => "l", "ľ" => "l", "ל" => "l",
            "מ" => "m", "м" => "m", "ם" => "m",
            "ñ" => "n", "ń" => "n", "н" => "n", "ņ" => "n", "ן" => "n", "ŋ" => "n", "נ" => "n", "ʼn" => "n", "ň" => "n",
            "ø" => "o", "ó" => "o", "ò" => "o", "ô" => "o", "õ" => "o", "о" => "o", "ő" => "o", "ŏ" => "o", "ō" => "o", "ǿ" => "o", "ǒ" => "o", "ơ" => "o",
            "פ" => "p", "ף" => "p", "п" => "p",
            "ק" => "q",
            "ŕ" => "r", "ř" => "r", "ŗ" => "r", "ר" => "r", "р" => "r", "®" => "r",
            "ş" => "s", "ś" => "s", "ș" => "s", "š" => "s", "с" => "s", "ŝ" => "s", "ס" => "s",
            "т" => "t", "ț" => "t", "ט" => "t", "ŧ" => "t", "ת" => "t", "ť" => "t", "ţ" => "t",
            "ù" => "u", "û" => "u", "ú" => "u", "ū" => "u", "у" => "u", "ũ" => "u", "ư" => "u", "ǔ" => "u", "ų" => "u", "ŭ" => "u", "ů" => "u", "ű" => "u", "ǖ" => "u", "ǜ" => "u", "ǚ" => "u", "ǘ" => "u",
            "в" => "v", "ו" => "v",
            "ý" => "y", "ы" => "y", "ŷ" => "y", "ÿ" => "y",
            "ź" => "z", "ž" => "z", "ż" => "z", "з" => "z", "ז" => "z", "ſ" => "z",
            "™" => "tm",
            "@" => "at",
            "Ä" => "ae", "Ǽ" => "ae", "ä" => "ae", "æ" => "ae", "ǽ" => "ae",
            "ij" => "ij", "IJ" => "ij",
            "я" => "ja", "Я" => "ja",
            "Э" => "je", "э" => "je",
            "ё" => "jo", "Ё" => "jo",
            "ю" => "ju", "Ю" => "ju",
            "œ" => "oe", "Œ" => "oe", "ö" => "oe", "Ö" => "oe",
            "щ" => "sch", "Щ" => "sch",
            "ш" => "sh", "Ш" => "sh",
            "ß" => "ss",
            "Ü" => "ue",
            "Ж" => "zh", "ж" => "zh",
        );
        return strtr($subject, $char_map);
    }
    
    $string = "Ħí ŧħə®ë, юßť å test!";
    echo replace_spec_char($string);
    

    Ħí ŧħə®ë, юßť å test! => Hi there, jusst a test!

    This does not mix up upper and lower case chars except for longer chars (eg: ss,ch, sch) , added @ ® ©

    Also if you want to build regex matching regardless to special chars :

    rss => '[rŕřŘŗŖרŔРр](?:[sșсŜšśסşСŝ][sșсŜšśסşСŝ]|[ß])'

    A vala implementation of this : https://code.launchpad.net/~jeremy-munsch/synapse-project/ascii-smart/+merge/277477

    Here is the base list you could work with, with regex replacing (in sublime text) or small script you can build anything from this array to fill your needs.

    "-" => "ъьЪЬ",
    "A" => "АĂǍĄÀÃÁÆÂÅǺĀא",
    "B" => "БבÞ",
    "C" => "ĈĆÇЦצĊČ©ץ",
    "D" => "ДĎĐדÐ",
    "E" => "ÈĘÉËÊЕĒĖĚĔЄƏע",
    "F" => "ФƑ",
    "G" => "ĞĠĢĜГגҐ",
    "H" => "חĦХĤה",
    "I" => "IÏÎÍÌĮĬIИĨǏיЇĪІ",
    "J" => "ЙĴ",
    "K" => "ĸכĶКך",
    "L" => "ŁĿЛĻĹĽל",
    "M" => "מМם",
    "N" => "ÑŃНŅןŊנʼnŇ",
    "O" => "ØÓÒÔÕОŐŎŌǾǑƠ",
    "P" => "פףП",
    "Q" => "ק",
    "R" => "ŔŘŖרР®",
    "S" => "ŞŚȘŠСŜס",
    "T" => "ТȚטŦתŤŢ",
    "U" => "ÙÛÚŪУŨƯǓŲŬŮŰǕǛǙǗ",
    "V" => "Вו",
    "Y" => "ÝЫŶŸ",
    "Z" => "ŹŽŻЗז",
    "a" => "аăǎąàãáæâåǻāא",
    "b" => "бבþ",
    "c" => "ĉćçцצċč©ץ",
    "ch" => "ч",
    "d" => "дďđדð",
    "e" => "èęéëêеēėěĕєəע",
    "f" => "фƒ",
    "g" => "ğġģĝгגґ",
    "h" => "חħхĥה",
    "i" => "iïîíìįĭıиĩǐיїīі",
    "j" => "йĵ",
    "k" => "ĸכķкך",
    "l" => "łŀлļĺľל",
    "m" => "מмם",
    "n" => "ñńнņןŋנʼnň",
    "o" => "øóòôõоőŏōǿǒơ",
    "p" => "פףп",
    "q" => "ק",
    "r" => "ŕřŗרр®",
    "s" => "şśșšсŝס",
    "t" => "тțטŧתťţ",
    "u" => "ùûúūуũưǔųŭůűǖǜǚǘ",
    "v" => "вו",
    "y" => "ýыŷÿ",
    "z" => "źžżзזſ",
    "tm" => "™",
    "at" => "@",
    "ae" => "ÄǼäæǽ",
    "ch" => "Чч",
    "ij" => "ijIJ",
    "j" => "йЙĴĵ",
    "ja" => "яЯ",
    "je" => "Ээ",
    "jo" => "ёЁ",
    "ju" => "юЮ",
    "oe" => "œŒöÖ",
    "sch" => "щЩ",
    "sh" => "шШ",
    "ss" => "ß",
    "tm" => "™",
    "ue" => "Ü",
    "zh" => "Жж"
    
    0 讨论(0)
  • 2020-11-22 17:09

    Disclaimer: I'm not supporting this answer anymore (I was blind at that time). But thanks for the up-votes =P

    You can take this as basis. From WordPress, used to generate pretty urls (the entry point is the slugify() function):

    /**
     * Converts all accent characters to ASCII characters.
     *
     * If there are no accent characters, then the string given is just returned.
     *
     * @param string $string Text that might have accent characters
     * @return string Filtered string with replaced "nice" characters.
     */
    
    function remove_accents($string) {
     if (!preg_match('/[\x80-\xff]/', $string))
      return $string;
     if (seems_utf8($string)) {
      $chars = array(
      // Decompositions for Latin-1 Supplement
      chr(195).chr(128) => 'A', chr(195).chr(129) => 'A',
      chr(195).chr(130) => 'A', chr(195).chr(131) => 'A',
      chr(195).chr(132) => 'A', chr(195).chr(133) => 'A',
      chr(195).chr(135) => 'C', chr(195).chr(136) => 'E',
      chr(195).chr(137) => 'E', chr(195).chr(138) => 'E',
      chr(195).chr(139) => 'E', chr(195).chr(140) => 'I',
      chr(195).chr(141) => 'I', chr(195).chr(142) => 'I',
      chr(195).chr(143) => 'I', chr(195).chr(145) => 'N',
      chr(195).chr(146) => 'O', chr(195).chr(147) => 'O',
      chr(195).chr(148) => 'O', chr(195).chr(149) => 'O',
      chr(195).chr(150) => 'O', chr(195).chr(153) => 'U',
      chr(195).chr(154) => 'U', chr(195).chr(155) => 'U',
      chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y',
      chr(195).chr(159) => 's', chr(195).chr(160) => 'a',
      chr(195).chr(161) => 'a', chr(195).chr(162) => 'a',
      chr(195).chr(163) => 'a', chr(195).chr(164) => 'a',
      chr(195).chr(165) => 'a', chr(195).chr(167) => 'c',
      chr(195).chr(168) => 'e', chr(195).chr(169) => 'e',
      chr(195).chr(170) => 'e', chr(195).chr(171) => 'e',
      chr(195).chr(172) => 'i', chr(195).chr(173) => 'i',
      chr(195).chr(174) => 'i', chr(195).chr(175) => 'i',
      chr(195).chr(177) => 'n', chr(195).chr(178) => 'o',
      chr(195).chr(179) => 'o', chr(195).chr(180) => 'o',
      chr(195).chr(181) => 'o', chr(195).chr(182) => 'o',
      chr(195).chr(182) => 'o', chr(195).chr(185) => 'u',
      chr(195).chr(186) => 'u', chr(195).chr(187) => 'u',
      chr(195).chr(188) => 'u', chr(195).chr(189) => 'y',
      chr(195).chr(191) => 'y',
      // Decompositions for Latin Extended-A
      chr(196).chr(128) => 'A', chr(196).chr(129) => 'a',
      chr(196).chr(130) => 'A', chr(196).chr(131) => 'a',
      chr(196).chr(132) => 'A', chr(196).chr(133) => 'a',
      chr(196).chr(134) => 'C', chr(196).chr(135) => 'c',
      chr(196).chr(136) => 'C', chr(196).chr(137) => 'c',
      chr(196).chr(138) => 'C', chr(196).chr(139) => 'c',
      chr(196).chr(140) => 'C', chr(196).chr(141) => 'c',
      chr(196).chr(142) => 'D', chr(196).chr(143) => 'd',
      chr(196).chr(144) => 'D', chr(196).chr(145) => 'd',
      chr(196).chr(146) => 'E', chr(196).chr(147) => 'e',
      chr(196).chr(148) => 'E', chr(196).chr(149) => 'e',
      chr(196).chr(150) => 'E', chr(196).chr(151) => 'e',
      chr(196).chr(152) => 'E', chr(196).chr(153) => 'e',
      chr(196).chr(154) => 'E', chr(196).chr(155) => 'e',
      chr(196).chr(156) => 'G', chr(196).chr(157) => 'g',
      chr(196).chr(158) => 'G', chr(196).chr(159) => 'g',
      chr(196).chr(160) => 'G', chr(196).chr(161) => 'g',
      chr(196).chr(162) => 'G', chr(196).chr(163) => 'g',
      chr(196).chr(164) => 'H', chr(196).chr(165) => 'h',
      chr(196).chr(166) => 'H', chr(196).chr(167) => 'h',
      chr(196).chr(168) => 'I', chr(196).chr(169) => 'i',
      chr(196).chr(170) => 'I', chr(196).chr(171) => 'i',
      chr(196).chr(172) => 'I', chr(196).chr(173) => 'i',
      chr(196).chr(174) => 'I', chr(196).chr(175) => 'i',
      chr(196).chr(176) => 'I', chr(196).chr(177) => 'i',
      chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij',
      chr(196).chr(180) => 'J', chr(196).chr(181) => 'j',
      chr(196).chr(182) => 'K', chr(196).chr(183) => 'k',
      chr(196).chr(184) => 'k', chr(196).chr(185) => 'L',
      chr(196).chr(186) => 'l', chr(196).chr(187) => 'L',
      chr(196).chr(188) => 'l', chr(196).chr(189) => 'L',
      chr(196).chr(190) => 'l', chr(196).chr(191) => 'L',
      chr(197).chr(128) => 'l', chr(197).chr(129) => 'L',
      chr(197).chr(130) => 'l', chr(197).chr(131) => 'N',
      chr(197).chr(132) => 'n', chr(197).chr(133) => 'N',
      chr(197).chr(134) => 'n', chr(197).chr(135) => 'N',
      chr(197).chr(136) => 'n', chr(197).chr(137) => 'N',
      chr(197).chr(138) => 'n', chr(197).chr(139) => 'N',
      chr(197).chr(140) => 'O', chr(197).chr(141) => 'o',
      chr(197).chr(142) => 'O', chr(197).chr(143) => 'o',
      chr(197).chr(144) => 'O', chr(197).chr(145) => 'o',
      chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe',
      chr(197).chr(148) => 'R',chr(197).chr(149) => 'r',
      chr(197).chr(150) => 'R',chr(197).chr(151) => 'r',
      chr(197).chr(152) => 'R',chr(197).chr(153) => 'r',
      chr(197).chr(154) => 'S',chr(197).chr(155) => 's',
      chr(197).chr(156) => 'S',chr(197).chr(157) => 's',
      chr(197).chr(158) => 'S',chr(197).chr(159) => 's',
      chr(197).chr(160) => 'S', chr(197).chr(161) => 's',
      chr(197).chr(162) => 'T', chr(197).chr(163) => 't',
      chr(197).chr(164) => 'T', chr(197).chr(165) => 't',
      chr(197).chr(166) => 'T', chr(197).chr(167) => 't',
      chr(197).chr(168) => 'U', chr(197).chr(169) => 'u',
      chr(197).chr(170) => 'U', chr(197).chr(171) => 'u',
      chr(197).chr(172) => 'U', chr(197).chr(173) => 'u',
      chr(197).chr(174) => 'U', chr(197).chr(175) => 'u',
      chr(197).chr(176) => 'U', chr(197).chr(177) => 'u',
      chr(197).chr(178) => 'U', chr(197).chr(179) => 'u',
      chr(197).chr(180) => 'W', chr(197).chr(181) => 'w',
      chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y',
      chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z',
      chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z',
      chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z',
      chr(197).chr(190) => 'z', chr(197).chr(191) => 's',
      // Euro Sign
      chr(226).chr(130).chr(172) => 'E',
      // GBP (Pound) Sign
      chr(194).chr(163) => '');
      $string = strtr($string, $chars);
     } else {
      // Assume ISO-8859-1 if not UTF-8
      $chars['in'] = chr(128).chr(131).chr(138).chr(142).chr(154).chr(158)
       .chr(159).chr(162).chr(165).chr(181).chr(192).chr(193).chr(194)
       .chr(195).chr(196).chr(197).chr(199).chr(200).chr(201).chr(202)
       .chr(203).chr(204).chr(205).chr(206).chr(207).chr(209).chr(210)
       .chr(211).chr(212).chr(213).chr(214).chr(216).chr(217).chr(218)
       .chr(219).chr(220).chr(221).chr(224).chr(225).chr(226).chr(227)
       .chr(228).chr(229).chr(231).chr(232).chr(233).chr(234).chr(235)
       .chr(236).chr(237).chr(238).chr(239).chr(241).chr(242).chr(243)
       .chr(244).chr(245).chr(246).chr(248).chr(249).chr(250).chr(251)
       .chr(252).chr(253).chr(255);
      $chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";
      $string = strtr($string, $chars['in'], $chars['out']);
      $double_chars['in'] = array(chr(140), chr(156), chr(198), chr(208), chr(222), chr(223), chr(230), chr(240), chr(254));
      $double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
      $string = str_replace($double_chars['in'], $double_chars['out'], $string);
     }
     return $string;
    }
    
    /**
     * Checks to see if a string is utf8 encoded.
     *
     * @author bmorel at ssi dot fr
     *
     * @param string $Str The string to be checked
     * @return bool True if $Str fits a UTF-8 model, false otherwise.
     */
    function seems_utf8($Str) { # by bmorel at ssi dot fr
     $length = strlen($Str);
     for ($i = 0; $i < $length; $i++) {
      if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
      elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n = 1; # 110bbbbb
      elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n = 2; # 1110bbbb
      elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n = 3; # 11110bbb
      elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n = 4; # 111110bb
      elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n = 5; # 1111110b
      else return false; # Does not match any model
      for ($j = 0; $j < $n; $j++) { # n bytes matching 10bbbbbb follow ?
       if ((++$i == $length) || ((ord($Str[$i]) & 0xC0) != 0x80))
       return false;
      }
     }
     return true;
    }
    
    function utf8_uri_encode($utf8_string, $length = 0) {
     $unicode = '';
     $values = array();
     $num_octets = 1;
     $unicode_length = 0;
     $string_length = strlen($utf8_string);
     for ($i = 0; $i < $string_length; $i++) {
      $value = ord($utf8_string[$i]);
      if ($value < 128) {
       if ($length && ($unicode_length >= $length))
        break;
       $unicode .= chr($value);
       $unicode_length++;
      } else {
       if (count($values) == 0) $num_octets = ($value < 224) ? 2 : 3;
       $values[] = $value;
       if ($length && ($unicode_length + ($num_octets * 3)) > $length)
        break;
       if (count( $values ) == $num_octets) {
        if ($num_octets == 3) {
         $unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]) . '%' . dechex($values[2]);
         $unicode_length += 9;
        } else {
         $unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]);
         $unicode_length += 6;
        }
        $values = array();
        $num_octets = 1;
       }
      }
     }
     return $unicode;
    }
    
    /**
     * Sanitizes title, replacing whitespace with dashes.
     *
     * Limits the output to alphanumeric characters, underscore (_) and dash (-).
     * Whitespace becomes a dash.
     *
     * @param string $title The title to be sanitized.
     * @return string The sanitized title.
     */
    function slugify($title) {
     $title = strip_tags($title);
     // Preserve escaped octets.
     $title = preg_replace('|%([a-fA-F0-9][a-fA-F0-9])|', '---$1---', $title);
     // Remove percent signs that are not part of an octet.
     $title = str_replace('%', '', $title);
     // Restore octets.
     $title = preg_replace('|---([a-fA-F0-9][a-fA-F0-9])---|', '%$1', $title);
     $title = remove_accents($title);
     if (seems_utf8($title)) {
      if (function_exists('mb_strtolower')) {
       $title = mb_strtolower($title, 'UTF-8');
      }
      $title = utf8_uri_encode($title, 200);
     }
     $title = strtolower($title);
     $title = preg_replace('/&.+?;/', '', $title); // kill entities
     $title = preg_replace('/[^%a-z0-9 _-]/', '', $title);
     $title = preg_replace('/\s+/', '-', $title);
     $title = preg_replace('|-+|', '-', $title);
     $title = trim($title, '-');
     return $title;
    }
    
    0 讨论(0)
提交回复
热议问题