I'm attempting to remove accents from characters in PHP string as the first step to making the string usable in a URL.
I'm using the following code:
$input = "Fóø Bår";
setlocale(LC_ALL, "en_US.utf8");
$output = iconv("utf-8", "ascii//TRANSLIT", $input);
print($output);
The output I would expect would be something like this:
F'oo Bar
However, instead of the accented characters being transliterated they are replaced with question marks:
F?? B?r
Everything I can find online indicates that setting the locale will fix this problem, however I'm already doing this. I've already checked the following details:
- The locale I am setting is supported by the server (included in the list produced by
locale -a
) - The source and target encodings (UTF-8 and ASCII) are supported by the server's version of iconv (included in the list produced by
iconv -l
) - The input string is UTF-8 encoded (verified using PHP's
mb_check_encoding
function, as suggested in the answer by mercator) - The call to
setlocale
is successful (it returns'en_US.utf8'
rather thanFALSE
)
The cause of the problem:
The server is using the wrong implementation of iconv. It has the glibc version instead of the required libiconv version.
Note that the iconv function on some systems may not work as you expect. In such case, it'd be a good idea to install the GNU libiconv library. It will most likely end up with more consistent results.
– PHP manual's introduction to iconv
Details about the iconv implementation that is used by PHP are included in the output of the phpinfo
function.
(I'm not able to re-compile PHP with the correct iconv library on the server I'm working with for this project so the answer I've accepted below is the one that was most useful for removing accents without iconv support.)
I think the problem here is that your encodings consider ä and å different symbols to 'a'. In fact, the PHP documentation for strtr offers a sample for removing accents the ugly way :(
What about WordPress implementation?
function remove_accents($string) {
if ( !preg_match('/[\x80-\xff]/', $string) )
return $string;
$chars = array(
// Decompositions for Latin-1 Supplement
chr(195).chr(128) => 'A', chr(195).chr(129) => 'A',
chr(195).chr(130) => 'A', chr(195).chr(131) => 'A',
chr(195).chr(132) => 'A', chr(195).chr(133) => 'A',
chr(195).chr(135) => 'C', chr(195).chr(136) => 'E',
chr(195).chr(137) => 'E', chr(195).chr(138) => 'E',
chr(195).chr(139) => 'E', chr(195).chr(140) => 'I',
chr(195).chr(141) => 'I', chr(195).chr(142) => 'I',
chr(195).chr(143) => 'I', chr(195).chr(145) => 'N',
chr(195).chr(146) => 'O', chr(195).chr(147) => 'O',
chr(195).chr(148) => 'O', chr(195).chr(149) => 'O',
chr(195).chr(150) => 'O', chr(195).chr(153) => 'U',
chr(195).chr(154) => 'U', chr(195).chr(155) => 'U',
chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y',
chr(195).chr(159) => 's', chr(195).chr(160) => 'a',
chr(195).chr(161) => 'a', chr(195).chr(162) => 'a',
chr(195).chr(163) => 'a', chr(195).chr(164) => 'a',
chr(195).chr(165) => 'a', chr(195).chr(167) => 'c',
chr(195).chr(168) => 'e', chr(195).chr(169) => 'e',
chr(195).chr(170) => 'e', chr(195).chr(171) => 'e',
chr(195).chr(172) => 'i', chr(195).chr(173) => 'i',
chr(195).chr(174) => 'i', chr(195).chr(175) => 'i',
chr(195).chr(177) => 'n', chr(195).chr(178) => 'o',
chr(195).chr(179) => 'o', chr(195).chr(180) => 'o',
chr(195).chr(181) => 'o', chr(195).chr(182) => 'o',
chr(195).chr(182) => 'o', chr(195).chr(185) => 'u',
chr(195).chr(186) => 'u', chr(195).chr(187) => 'u',
chr(195).chr(188) => 'u', chr(195).chr(189) => 'y',
chr(195).chr(191) => 'y',
// Decompositions for Latin Extended-A
chr(196).chr(128) => 'A', chr(196).chr(129) => 'a',
chr(196).chr(130) => 'A', chr(196).chr(131) => 'a',
chr(196).chr(132) => 'A', chr(196).chr(133) => 'a',
chr(196).chr(134) => 'C', chr(196).chr(135) => 'c',
chr(196).chr(136) => 'C', chr(196).chr(137) => 'c',
chr(196).chr(138) => 'C', chr(196).chr(139) => 'c',
chr(196).chr(140) => 'C', chr(196).chr(141) => 'c',
chr(196).chr(142) => 'D', chr(196).chr(143) => 'd',
chr(196).chr(144) => 'D', chr(196).chr(145) => 'd',
chr(196).chr(146) => 'E', chr(196).chr(147) => 'e',
chr(196).chr(148) => 'E', chr(196).chr(149) => 'e',
chr(196).chr(150) => 'E', chr(196).chr(151) => 'e',
chr(196).chr(152) => 'E', chr(196).chr(153) => 'e',
chr(196).chr(154) => 'E', chr(196).chr(155) => 'e',
chr(196).chr(156) => 'G', chr(196).chr(157) => 'g',
chr(196).chr(158) => 'G', chr(196).chr(159) => 'g',
chr(196).chr(160) => 'G', chr(196).chr(161) => 'g',
chr(196).chr(162) => 'G', chr(196).chr(163) => 'g',
chr(196).chr(164) => 'H', chr(196).chr(165) => 'h',
chr(196).chr(166) => 'H', chr(196).chr(167) => 'h',
chr(196).chr(168) => 'I', chr(196).chr(169) => 'i',
chr(196).chr(170) => 'I', chr(196).chr(171) => 'i',
chr(196).chr(172) => 'I', chr(196).chr(173) => 'i',
chr(196).chr(174) => 'I', chr(196).chr(175) => 'i',
chr(196).chr(176) => 'I', chr(196).chr(177) => 'i',
chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij',
chr(196).chr(180) => 'J', chr(196).chr(181) => 'j',
chr(196).chr(182) => 'K', chr(196).chr(183) => 'k',
chr(196).chr(184) => 'k', chr(196).chr(185) => 'L',
chr(196).chr(186) => 'l', chr(196).chr(187) => 'L',
chr(196).chr(188) => 'l', chr(196).chr(189) => 'L',
chr(196).chr(190) => 'l', chr(196).chr(191) => 'L',
chr(197).chr(128) => 'l', chr(197).chr(129) => 'L',
chr(197).chr(130) => 'l', chr(197).chr(131) => 'N',
chr(197).chr(132) => 'n', chr(197).chr(133) => 'N',
chr(197).chr(134) => 'n', chr(197).chr(135) => 'N',
chr(197).chr(136) => 'n', chr(197).chr(137) => 'N',
chr(197).chr(138) => 'n', chr(197).chr(139) => 'N',
chr(197).chr(140) => 'O', chr(197).chr(141) => 'o',
chr(197).chr(142) => 'O', chr(197).chr(143) => 'o',
chr(197).chr(144) => 'O', chr(197).chr(145) => 'o',
chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe',
chr(197).chr(148) => 'R',chr(197).chr(149) => 'r',
chr(197).chr(150) => 'R',chr(197).chr(151) => 'r',
chr(197).chr(152) => 'R',chr(197).chr(153) => 'r',
chr(197).chr(154) => 'S',chr(197).chr(155) => 's',
chr(197).chr(156) => 'S',chr(197).chr(157) => 's',
chr(197).chr(158) => 'S',chr(197).chr(159) => 's',
chr(197).chr(160) => 'S', chr(197).chr(161) => 's',
chr(197).chr(162) => 'T', chr(197).chr(163) => 't',
chr(197).chr(164) => 'T', chr(197).chr(165) => 't',
chr(197).chr(166) => 'T', chr(197).chr(167) => 't',
chr(197).chr(168) => 'U', chr(197).chr(169) => 'u',
chr(197).chr(170) => 'U', chr(197).chr(171) => 'u',
chr(197).chr(172) => 'U', chr(197).chr(173) => 'u',
chr(197).chr(174) => 'U', chr(197).chr(175) => 'u',
chr(197).chr(176) => 'U', chr(197).chr(177) => 'u',
chr(197).chr(178) => 'U', chr(197).chr(179) => 'u',
chr(197).chr(180) => 'W', chr(197).chr(181) => 'w',
chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y',
chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z',
chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z',
chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z',
chr(197).chr(190) => 'z', chr(197).chr(191) => 's'
);
$string = strtr($string, $chars);
return $string;
}
To better understand what this function does, check this corresponding conversion table here:
À => A
Á => A
 => A
à => A
Ä => A
Å => A
Ç => C
È => E
É => E
Ê => E
Ë => E
Ì => I
Í => I
Î => I
Ï => I
Ñ => N
Ò => O
Ó => O
Ô => O
Õ => O
Ö => O
Ù => U
Ú => U
Û => U
Ü => U
Ý => Y
ß => s
à => a
á => a
â => a
ã => a
ä => a
å => a
ç => c
è => e
é => e
ê => e
ë => e
ì => i
í => i
î => i
ï => i
ñ => n
ò => o
ó => o
ô => o
õ => o
ö => o
ù => u
ú => u
û => u
ü => u
ý => y
ÿ => y
Ā => A
ā => a
Ă => A
ă => a
Ą => A
ą => a
Ć => C
ć => c
Ĉ => C
ĉ => c
Ċ => C
ċ => c
Č => C
č => c
Ď => D
ď => d
Đ => D
đ => d
Ē => E
ē => e
Ĕ => E
ĕ => e
Ė => E
ė => e
Ę => E
ę => e
Ě => E
ě => e
Ĝ => G
ĝ => g
Ğ => G
ğ => g
Ġ => G
ġ => g
Ģ => G
ģ => g
Ĥ => H
ĥ => h
Ħ => H
ħ => h
Ĩ => I
ĩ => i
Ī => I
ī => i
Ĭ => I
ĭ => i
Į => I
į => i
İ => I
ı => i
IJ => IJ
ij => ij
Ĵ => J
ĵ => j
Ķ => K
ķ => k
ĸ => k
Ĺ => L
ĺ => l
Ļ => L
ļ => l
Ľ => L
ľ => l
Ŀ => L
ŀ => l
Ł => L
ł => l
Ń => N
ń => n
Ņ => N
ņ => n
Ň => N
ň => n
ʼn => N
Ŋ => n
ŋ => N
Ō => O
ō => o
Ŏ => O
ŏ => o
Ő => O
ő => o
Œ => OE
œ => oe
Ŕ => R
ŕ => r
Ŗ => R
ŗ => r
Ř => R
ř => r
Ś => S
ś => s
Ŝ => S
ŝ => s
Ş => S
ş => s
Š => S
š => s
Ţ => T
ţ => t
Ť => T
ť => t
Ŧ => T
ŧ => t
Ũ => U
ũ => u
Ū => U
ū => u
Ŭ => U
ŭ => u
Ů => U
ů => u
Ű => U
ű => u
Ų => U
ų => u
Ŵ => W
ŵ => w
Ŷ => Y
ŷ => y
Ÿ => Y
Ź => Z
ź => z
Ż => Z
ż => z
Ž => Z
ž => z
ſ => s
You can generate this convesion table yourself by simply iterarting over the $chars
array of the function:
foreach($chars as $k=>$v) {
printf("%s -> %s", $k, $v);
}
This is a piece of code I found and use often:
function stripAccents($stripAccents){
return strtr($stripAccents,'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ','aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY');
}
UTF-8 friendly version of the simple function posted above by Gino:
function stripAccents($str) {
return strtr(utf8_decode($str), utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'), 'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY');
}
Had to come to this because my php document was UTF-8 encoded.
Hope it helps.
When using iconv
, the parameter locale must be set:
function test_enc($text = 'ěščřžýáíé ĚŠČŘŽÝÁÍÉ fóø bår FÓØ BÅR æ')
{
echo '<tt>';
echo iconv('utf8', 'ascii//TRANSLIT', $text);
echo '</tt><br/>';
}
test_enc();
setlocale(LC_ALL, 'cs_CZ.utf8');
test_enc();
setlocale(LC_ALL, 'en_US.utf8');
test_enc();
Yields into:
????????? ????????? f?? b?r F?? B?R ae
escrzyaie ESCRZYAIE fo? bar FO? BAR ae
escrzyaie ESCRZYAIE fo? bar FO? BAR ae
Another locales then cs_CZ and en_US I haven't installed and I can't test it.
In C# I see solution using translation to unicode normalized form - accents are splitted out and then filtered via nonspacing unicode category.
The easiest way is to use iconv()
PHP native function.
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', "Thîs îs à vêry wrong séntènce!");
// output: This is a very wrong sentence!
if you have http://php.net/manual/en/book.intl.php available, this solved your problem
$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);
You could use urlencode. Does not quite do what you want (remove accents), but will give you a url usable string
$output = urlencode ($input);
In Perl I could use a translate regex, but I cannot think of the PHP equivalent
$input =~ tr/áâàå/aaaa/;
etc...
you could do this using preg_replace
$patterns[0] = '/[á|â|à|å|ä]/';
$patterns[1] = '/[ð|é|ê|è|ë]/';
$patterns[2] = '/[í|î|ì|ï]/';
$patterns[3] = '/[ó|ô|ò|ø|õ|ö]/';
$patterns[4] = '/[ú|û|ù|ü]/';
$patterns[5] = '/æ/';
$patterns[6] = '/ç/';
$patterns[7] = '/ß/';
$replacements[0] = 'a';
$replacements[1] = 'e';
$replacements[2] = 'i';
$replacements[3] = 'o';
$replacements[4] = 'u';
$replacements[5] = 'ae';
$replacements[6] = 'c';
$replacements[7] = 'ss';
$output = preg_replace($patterns, $replacements, $input);
(Please note this was typed from a foggy beer ridden Friday after noon memory, so may not be 100% correct)
or you could make a hash table and do a replacement based off of that.
Indeed is a matter of taste. There are many flavors for converting such letters.
function replaceAccents($str)
{
$a = array('À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ÿ', 'Ā', 'ā', 'Ă', 'ă', 'Ą', 'ą', 'Ć', 'ć', 'Ĉ', 'ĉ', 'Ċ', 'ċ', 'Č', 'č', 'Ď', 'ď', 'Đ', 'đ', 'Ē', 'ē', 'Ĕ', 'ĕ', 'Ė', 'ė', 'Ę', 'ę', 'Ě', 'ě', 'Ĝ', 'ĝ', 'Ğ', 'ğ', 'Ġ', 'ġ', 'Ģ', 'ģ', 'Ĥ', 'ĥ', 'Ħ', 'ħ', 'Ĩ', 'ĩ', 'Ī', 'ī', 'Ĭ', 'ĭ', 'Į', 'į', 'İ', 'ı', 'IJ', 'ij', 'Ĵ', 'ĵ', 'Ķ', 'ķ', 'Ĺ', 'ĺ', 'Ļ', 'ļ', 'Ľ', 'ľ', 'Ŀ', 'ŀ', 'Ł', 'ł', 'Ń', 'ń', 'Ņ', 'ņ', 'Ň', 'ň', 'ʼn', 'Ō', 'ō', 'Ŏ', 'ŏ', 'Ő', 'ő', 'Œ', 'œ', 'Ŕ', 'ŕ', 'Ŗ', 'ŗ', 'Ř', 'ř', 'Ś', 'ś', 'Ŝ', 'ŝ', 'Ş', 'ş', 'Š', 'š', 'Ţ', 'ţ', 'Ť', 'ť', 'Ŧ', 'ŧ', 'Ũ', 'ũ', 'Ū', 'ū', 'Ŭ', 'ŭ', 'Ů', 'ů', 'Ű', 'ű', 'Ų', 'ų', 'Ŵ', 'ŵ', 'Ŷ', 'ŷ', 'Ÿ', 'Ź', 'ź', 'Ż', 'ż', 'Ž', 'ž', 'ſ', 'ƒ', 'Ơ', 'ơ', 'Ư', 'ư', 'Ǎ', 'ǎ', 'Ǐ', 'ǐ', 'Ǒ', 'ǒ', 'Ǔ', 'ǔ', 'Ǖ', 'ǖ', 'Ǘ', 'ǘ', 'Ǚ', 'ǚ', 'Ǜ', 'ǜ', 'Ǻ', 'ǻ', 'Ǽ', 'ǽ', 'Ǿ', 'ǿ');
$b = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o');
return str_replace($a, $b, $str);
}
here is a simple function that i use usually to remove accents :
function str_without_accents($str, $charset='utf-8')
{
$str = htmlentities($str, ENT_NOQUOTES, $charset);
$str = preg_replace('#&([A-za-z])(?:acute|cedil|caron|circ|grave|orn|ring|slash|th|tilde|uml);#', '\1', $str);
$str = preg_replace('#&([A-za-z]{2})(?:lig);#', '\1', $str); // pour les ligatures e.g. 'œ'
$str = preg_replace('#&[^;]+;#', '', $str); // supprime les autres caractères
return $str; // or add this : mb_strtoupper($str); for uppercase :)
}
I agree with georgebrock's comment.
If you find a way to get //TRANSLIT to work, you can build friendly URLs:
- use iconv with //TRANSLIT ñ => n~
- remove non-alphanumeric non-whitespace chars inside words:
$url = preg_replace( '/(\w)[^\w\s](\w)/', '$1$2', $url );
- replace remaining separations:
$url = preg_replace( '/[^a-z0-9]+/', '-', $url );
- remove double/leading/traling:
$url = preg_replace( '-'
, e.g.'/(?:(^|\-)\-+|\-$)/', '', $url );
- remove non-alphanumeric non-whitespace chars inside words:
If you can't get it to work, replace setp 1 with strtr/character-based replacement, like Xetius' solution.
I can't reproduce your problem. I get the expected result.
How exactly are you using mb_detect_encoding()
to verify your string is in fact UTF-8?
If I simply call mb_detect_encoding($input)
on both a UTF-8 and ISO-8859-1 encoded version of your string, both of them return "UTF-8", so that function isn't particularly reliable.
iconv()
gives me a PHP "notice" when it gets the wrongly encoded string and only echoes "F", but that might just be because of different PHP/iconv settings/versions (?).
I suggest to you try calling mb_check_encoding($input, "utf-8")
first to verify that your string really is UTF-8. I think it probably isn't.
Merged Cazuma Nii Cavalcanti's implementation with Junior Mayhé's char list, hoping to save some time for some of you.
function stripAccents($str) {
return strtr(utf8_decode($str), utf8_decode('ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƒƠơƯưǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǺǻǼǽǾǿ'), 'AAAAAAAECEEEEIIIIDNOOOOOOUUUUYsaaaaaaaeceeeeiiiinoooooouuuuyyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIiIJijJjKkLlLlLlLlllNnNnNnnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZzsfOoUuAaIiOoUuUuUuUuUuAaAEaeOo');
}
I just created a removeAccents method based on the reading of this thread and this other one too (How to remove accents and turn letters into "plain" ASCII characters?).
The method is here: https://github.com/lingtalfi/Bat/blob/master/StringTool.md#removeaccents
Tests are here: https://github.com/lingtalfi/Bat/blob/master/btests/StringTool/removeAccents/stringTool.removeAccents.test.php,
and here is what was tested so far:
$a = [
// easy
'',
'a',
'après',
'dédé fait la fête ?',
// hard
'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ',
'ŻŹĆŃĄŚŁĘÓżźćńąśłęó',
'qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq',
'ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöøùúûüýÿ',
'ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ',
'ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİĴĵĶķ',
'ĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž',
'ſƒƠơƯưǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǺǻǾǿ',
'Ǽǽ',
];
and it converts only accentuated things (letters/ligatures/cédilles/some letters with a line through/...?).
Here is the content of the method: (https://github.com/lingtalfi/Bat/blob/master/StringTool.php#L83)
public static function removeAccents($str)
{
static $map = [
// single letters
'à' => 'a',
'á' => 'a',
'â' => 'a',
'ã' => 'a',
'ä' => 'a',
'ą' => 'a',
'å' => 'a',
'ā' => 'a',
'ă' => 'a',
'ǎ' => 'a',
'ǻ' => 'a',
'À' => 'A',
'Á' => 'A',
'Â' => 'A',
'Ã' => 'A',
'Ä' => 'A',
'Ą' => 'A',
'Å' => 'A',
'Ā' => 'A',
'Ă' => 'A',
'Ǎ' => 'A',
'Ǻ' => 'A',
'ç' => 'c',
'ć' => 'c',
'ĉ' => 'c',
'ċ' => 'c',
'č' => 'c',
'Ç' => 'C',
'Ć' => 'C',
'Ĉ' => 'C',
'Ċ' => 'C',
'Č' => 'C',
'ď' => 'd',
'đ' => 'd',
'Ð' => 'D',
'Ď' => 'D',
'Đ' => 'D',
'è' => 'e',
'é' => 'e',
'ê' => 'e',
'ë' => 'e',
'ę' => 'e',
'ē' => 'e',
'ĕ' => 'e',
'ė' => 'e',
'ě' => 'e',
'È' => 'E',
'É' => 'E',
'Ê' => 'E',
'Ë' => 'E',
'Ę' => 'E',
'Ē' => 'E',
'Ĕ' => 'E',
'Ė' => 'E',
'Ě' => 'E',
'ƒ' => 'f',
'ĝ' => 'g',
'ğ' => 'g',
'ġ' => 'g',
'ģ' => 'g',
'Ĝ' => 'G',
'Ğ' => 'G',
'Ġ' => 'G',
'Ģ' => 'G',
'ĥ' => 'h',
'ħ' => 'h',
'Ĥ' => 'H',
'Ħ' => 'H',
'ì' => 'i',
'í' => 'i',
'î' => 'i',
'ï' => 'i',
'ĩ' => 'i',
'ī' => 'i',
'ĭ' => 'i',
'į' => 'i',
'ſ' => 'i',
'ǐ' => 'i',
'Ì' => 'I',
'Í' => 'I',
'Î' => 'I',
'Ï' => 'I',
'Ĩ' => 'I',
'Ī' => 'I',
'Ĭ' => 'I',
'Į' => 'I',
'İ' => 'I',
'Ǐ' => 'I',
'ĵ' => 'j',
'Ĵ' => 'J',
'ķ' => 'k',
'Ķ' => 'K',
'ł' => 'l',
'ĺ' => 'l',
'ļ' => 'l',
'ľ' => 'l',
'ŀ' => 'l',
'Ł' => 'L',
'Ĺ' => 'L',
'Ļ' => 'L',
'Ľ' => 'L',
'Ŀ' => 'L',
'ñ' => 'n',
'ń' => 'n',
'ņ' => 'n',
'ň' => 'n',
'ʼn' => 'n',
'Ñ' => 'N',
'Ń' => 'N',
'Ņ' => 'N',
'Ň' => 'N',
'ò' => 'o',
'ó' => 'o',
'ô' => 'o',
'õ' => 'o',
'ö' => 'o',
'ð' => 'o',
'ø' => 'o',
'ō' => 'o',
'ŏ' => 'o',
'ő' => 'o',
'ơ' => 'o',
'ǒ' => 'o',
'ǿ' => 'o',
'Ò' => 'O',
'Ó' => 'O',
'Ô' => 'O',
'Õ' => 'O',
'Ö' => 'O',
'Ø' => 'O',
'Ō' => 'O',
'Ŏ' => 'O',
'Ő' => 'O',
'Ơ' => 'O',
'Ǒ' => 'O',
'Ǿ' => 'O',
'ŕ' => 'r',
'ŗ' => 'r',
'ř' => 'r',
'Ŕ' => 'R',
'Ŗ' => 'R',
'Ř' => 'R',
'ś' => 's',
'š' => 's',
'ŝ' => 's',
'ş' => 's',
'Ś' => 'S',
'Š' => 'S',
'Ŝ' => 'S',
'Ş' => 'S',
'ţ' => 't',
'ť' => 't',
'ŧ' => 't',
'Ţ' => 'T',
'Ť' => 'T',
'Ŧ' => 'T',
'ù' => 'u',
'ú' => 'u',
'û' => 'u',
'ü' => 'u',
'ũ' => 'u',
'ū' => 'u',
'ŭ' => 'u',
'ů' => 'u',
'ű' => 'u',
'ų' => 'u',
'ư' => 'u',
'ǔ' => 'u',
'ǖ' => 'u',
'ǘ' => 'u',
'ǚ' => 'u',
'ǜ' => 'u',
'Ù' => 'U',
'Ú' => 'U',
'Û' => 'U',
'Ü' => 'U',
'Ũ' => 'U',
'Ū' => 'U',
'Ŭ' => 'U',
'Ů' => 'U',
'Ű' => 'U',
'Ų' => 'U',
'Ư' => 'U',
'Ǔ' => 'U',
'Ǖ' => 'U',
'Ǘ' => 'U',
'Ǚ' => 'U',
'Ǜ' => 'U',
'ŵ' => 'w',
'Ŵ' => 'W',
'ý' => 'y',
'ÿ' => 'y',
'ŷ' => 'y',
'Ý' => 'Y',
'Ÿ' => 'Y',
'Ŷ' => 'Y',
'ż' => 'z',
'ź' => 'z',
'ž' => 'z',
'Ż' => 'Z',
'Ź' => 'Z',
'Ž' => 'Z',
// accentuated ligatures
'Ǽ' => 'A',
'ǽ' => 'a',
];
return strtr($str, $map);
}
In laravel you can simply use str_slug($accentedPhrase)
and if you care about dash (-) that this method substitute with space you can use str_replace('-', ' ', str_slug($accentedPhrase))
You can use an array key => value style to use with strtr() safely for UTF-8 characters even if they are multi-bytes.
function no_accent($str){
$accents = array('À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'A', 'Å' => 'A', 'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a', 'ä' => 'a', 'å' => 'a', 'Ā' => 'A', 'ā' => 'a', 'Ă' => 'A', 'ă' => 'a', 'Ą' => 'A', 'ą' => 'a', 'Ç' => 'C', 'ç' => 'c', 'Ć' => 'C', 'ć' => 'c', 'Ĉ' => 'C', 'ĉ' => 'c', 'Ċ' => 'C', 'ċ' => 'c', 'Č' => 'C', 'č' => 'c', 'Ð' => 'D', 'ð' => 'd', 'Ď' => 'D', 'ď' => 'd', 'Đ' => 'D', 'đ' => 'd', 'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e', 'Ē' => 'E', 'ē' => 'e', 'Ĕ' => 'E', 'ĕ' => 'e', 'Ė' => 'E', 'ė' => 'e', 'Ę' => 'E', 'ę' => 'e', 'Ě' => 'E', 'ě' => 'e', 'Ĝ' => 'G', 'ĝ' => 'g', 'Ğ' => 'G', 'ğ' => 'g', 'Ġ' => 'G', 'ġ' => 'g', 'Ģ' => 'G', 'ģ' => 'g', 'Ĥ' => 'H', 'ĥ' => 'h', 'Ħ' => 'H', 'ħ' => 'h', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I', 'Ï' => 'I', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'Ĩ' => 'I', 'ĩ' => 'i', 'Ī' => 'I', 'ī' => 'i', 'Ĭ' => 'I', 'ĭ' => 'i', 'Į' => 'I', 'į' => 'i', 'İ' => 'I', 'ı' => 'i', 'Ĵ' => 'J', 'ĵ' => 'j', 'Ķ' => 'K', 'ķ' => 'k', 'ĸ' => 'k', 'Ĺ' => 'L', 'ĺ' => 'l', 'Ļ' => 'L', 'ļ' => 'l', 'Ľ' => 'L', 'ľ' => 'l', 'Ŀ' => 'L', 'ŀ' => 'l', 'Ł' => 'L', 'ł' => 'l', 'Ñ' => 'N', 'ñ' => 'n', 'Ń' => 'N', 'ń' => 'n', 'Ņ' => 'N', 'ņ' => 'n', 'Ň' => 'N', 'ň' => 'n', 'ʼn' => 'n', 'Ŋ' => 'N', 'ŋ' => 'n', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O', 'Ö' => 'O', 'Ø' => 'O', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'o', 'ø' => 'o', 'Ō' => 'O', 'ō' => 'o', 'Ŏ' => 'O', 'ŏ' => 'o', 'Ő' => 'O', 'ő' => 'o', 'Ŕ' => 'R', 'ŕ' => 'r', 'Ŗ' => 'R', 'ŗ' => 'r', 'Ř' => 'R', 'ř' => 'r', 'Ś' => 'S', 'ś' => 's', 'Ŝ' => 'S', 'ŝ' => 's', 'Ş' => 'S', 'ş' => 's', 'Š' => 'S', 'š' => 's', 'ſ' => 's', 'Ţ' => 'T', 'ţ' => 't', 'Ť' => 'T', 'ť' => 't', 'Ŧ' => 'T', 'ŧ' => 't', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'U', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u', 'Ũ' => 'U', 'ũ' => 'u', 'Ū' => 'U', 'ū' => 'u', 'Ŭ' => 'U', 'ŭ' => 'u', 'Ů' => 'U', 'ů' => 'u', 'Ű' => 'U', 'ű' => 'u', 'Ų' => 'U', 'ų' => 'u', 'Ŵ' => 'W', 'ŵ' => 'w', 'Ý' => 'Y', 'ý' => 'y', 'ÿ' => 'y', 'Ŷ' => 'Y', 'ŷ' => 'y', 'Ÿ' => 'Y', 'Ź' => 'Z', 'ź' => 'z', 'Ż' => 'Z', 'ż' => 'z', 'Ž' => 'Z', 'ž' => 'z');
return strtr($str, $accents);
}
Plus, you save decode/encode in UTF-8 part.
$unwanted_array = array( '&' => 'and', '&' => 'and', '@' => 'at', '©' => 'c', '®' => 'r',
'̊'=>'','̧'=>'','̨'=>'','̄'=>'','̱'=>'',
'Á'=>'a','á'=>'a','À'=>'a','à'=>'a','Ă'=>'a','ă'=>'a','ắ'=>'a','Ắ'=>'A','Ằ'=>'A',
'ằ'=>'a','ẵ'=>'a','Ẵ'=>'A','ẳ'=>'a','Ẳ'=>'A','Â'=>'a','â'=>'a','ấ'=>'a','Ấ'=>'A',
'ầ'=>'a','Ầ'=>'a','ẩ'=>'a','Ẩ'=>'A','Ǎ'=>'a','ǎ'=>'a','Å'=>'a','å'=>'a','Ǻ'=>'a',
'ǻ'=>'a','Ä'=>'a','ä'=>'a','ã'=>'a','Ã'=>'A','Ą'=>'a','ą'=>'a','Ā'=>'a','ā'=>'a',
'ả'=>'a','Ả'=>'a','Ạ'=>'A','ạ'=>'a','ặ'=>'a','Ặ'=>'A','ậ'=>'a','Ậ'=>'A','Æ'=>'ae',
'æ'=>'ae','Ǽ'=>'ae','ǽ'=>'ae','ẫ'=>'a','Ẫ'=>'A',
'Ć'=>'c','ć'=>'c','Ĉ'=>'c','ĉ'=>'c','Č'=>'c','č'=>'c','Ċ'=>'c','ċ'=>'c','Ç'=>'c','ç'=>'c',
'Ď'=>'d','ď'=>'d','Ḑ'=>'D','ḑ'=>'d','Đ'=>'d','đ'=>'d','Ḍ'=>'D','ḍ'=>'d','Ḏ'=>'D','ḏ'=>'d','ð'=>'d','Ð'=>'D',
'É'=>'e','é'=>'e','È'=>'e','è'=>'e','Ĕ'=>'e','ĕ'=>'e','ê'=>'e','ế'=>'e','Ế'=>'E','ề'=>'e',
'Ề'=>'E','Ě'=>'e','ě'=>'e','Ë'=>'e','ë'=>'e','Ė'=>'e','ė'=>'e','Ę'=>'e','ę'=>'e','Ē'=>'e',
'ē'=>'e','ệ'=>'e','Ệ'=>'E','Ə'=>'e','ə'=>'e','ẽ'=>'e','Ẽ'=>'E','ễ'=>'e',
'Ễ'=>'E','ể'=>'e','Ể'=>'E','ẻ'=>'e','Ẻ'=>'E','ẹ'=>'e','Ẹ'=>'E',
'ƒ'=>'f',
'Ğ'=>'g','ğ'=>'g','Ĝ'=>'g','ĝ'=>'g','Ǧ'=>'G','ǧ'=>'g','Ġ'=>'g','ġ'=>'g','Ģ'=>'g','ģ'=>'g',
'H̲'=>'H','h̲'=>'h','Ĥ'=>'h','ĥ'=>'h','Ȟ'=>'H','ȟ'=>'h','Ḩ'=>'H','ḩ'=>'h','Ħ'=>'h','ħ'=>'h','Ḥ'=>'H','ḥ'=>'h',
'Ỉ'=>'I','Í'=>'i','í'=>'i','Ì'=>'i','ì'=>'i','Ĭ'=>'i','ĭ'=>'i','Î'=>'i','î'=>'i','Ǐ'=>'i','ǐ'=>'i',
'Ï'=>'i','ï'=>'i','Ḯ'=>'I','ḯ'=>'i','Ĩ'=>'i','ĩ'=>'i','İ'=>'i','Į'=>'i','į'=>'i','Ī'=>'i','ī'=>'i',
'ỉ'=>'I','Ị'=>'I','ị'=>'i','IJ'=>'ij','ij'=>'ij','ı'=>'i',
'Ĵ'=>'j','ĵ'=>'j',
'Ķ'=>'k','ķ'=>'k','Ḵ'=>'K','ḵ'=>'k',
'Ĺ'=>'l','ĺ'=>'l','Ľ'=>'l','ľ'=>'l','Ļ'=>'l','ļ'=>'l','Ł'=>'l','ł'=>'l','Ŀ'=>'l','ŀ'=>'l',
'Ń'=>'n','ń'=>'n','Ň'=>'n','ň'=>'n','Ñ'=>'N','ñ'=>'n','Ņ'=>'n','ņ'=>'n','Ṇ'=>'N','ṇ'=>'n','Ŋ'=>'n','ŋ'=>'n',
'Ó'=>'o','ó'=>'o','Ò'=>'o','ò'=>'o','Ŏ'=>'o','ŏ'=>'o','Ô'=>'o','ô'=>'o','ố'=>'o','Ố'=>'O','ồ'=>'o',
'Ồ'=>'O','ổ'=>'o','Ổ'=>'O','Ǒ'=>'o','ǒ'=>'o','Ö'=>'o','ö'=>'o','Ő'=>'o','ő'=>'o','Õ'=>'o','õ'=>'o',
'Ø'=>'o','ø'=>'o','Ǿ'=>'o','ǿ'=>'o','Ǫ'=>'O','ǫ'=>'o','Ǭ'=>'O','ǭ'=>'o','Ō'=>'o','ō'=>'o','ỏ'=>'o',
'Ỏ'=>'O','Ơ'=>'o','ơ'=>'o','ớ'=>'o','Ớ'=>'O','ờ'=>'o','Ờ'=>'O','ở'=>'o','Ở'=>'O','ợ'=>'o','Ợ'=>'O',
'ọ'=>'o','Ọ'=>'O','ọ'=>'o','Ọ'=>'O','ộ'=>'o','Ộ'=>'O','ỗ'=>'o','Ỗ'=>'O','ỡ'=>'o','Ỡ'=>'O',
'Œ'=>'oe','œ'=>'oe',
'ĸ'=>'k',
'Ŕ'=>'r','ŕ'=>'r','Ř'=>'r','ř'=>'r','ṙ'=>'r','Ŗ'=>'r','ŗ'=>'r','Ṛ'=>'R','ṛ'=>'r','Ṟ'=>'R','ṟ'=>'r',
'S̲'=>'S','s̲'=>'s','Ś'=>'s','ś'=>'s','Ŝ'=>'s','ŝ'=>'s','Š'=>'s','š'=>'s','Ş'=>'s','ş'=>'s',
'Ṣ'=>'S','ṣ'=>'s','Ș'=>'S','ș'=>'s',
'ſ'=>'z','ß'=>'ss','Ť'=>'t','ť'=>'t','Ţ'=>'t','ţ'=>'t','Ṭ'=>'T','ṭ'=>'t','Ț'=>'T',
'ț'=>'t','Ṯ'=>'T','ṯ'=>'t','™'=>'tm','Ŧ'=>'t','ŧ'=>'t',
'Ú'=>'u','ú'=>'u','Ù'=>'u','ù'=>'u','Ŭ'=>'u','ŭ'=>'u','Û'=>'u','û'=>'u','Ǔ'=>'u','ǔ'=>'u','Ů'=>'u','ů'=>'u',
'Ü'=>'u','ü'=>'u','Ǘ'=>'u','ǘ'=>'u','Ǜ'=>'u','ǜ'=>'u','Ǚ'=>'u','ǚ'=>'u','Ǖ'=>'u','ǖ'=>'u','Ű'=>'u','ű'=>'u',
'Ũ'=>'u','ũ'=>'u','Ų'=>'u','ų'=>'u','Ū'=>'u','ū'=>'u','Ư'=>'u','ư'=>'u','ứ'=>'u','Ứ'=>'U','ừ'=>'u','Ừ'=>'U',
'ử'=>'u','Ử'=>'U','ự'=>'u','Ự'=>'U','ụ'=>'u','Ụ'=>'U','ủ'=>'u','Ủ'=>'U','ữ'=>'u','Ữ'=>'U',
'Ŵ'=>'w','ŵ'=>'w',
'Ý'=>'y','ý'=>'y','ỳ'=>'y','Ỳ'=>'Y','Ŷ'=>'y','ŷ'=>'y','ÿ'=>'y','Ÿ'=>'y','ỹ'=>'y','Ỹ'=>'Y','ỷ'=>'y','Ỷ'=>'Y',
'Z̲'=>'Z','z̲'=>'z','Ź'=>'z','ź'=>'z','Ž'=>'z','ž'=>'z','Ż'=>'z','ż'=>'z','Ẕ'=>'Z','ẕ'=>'z',
'þ'=>'p','ʼn'=>'n','А'=>'a','а'=>'a','Б'=>'b','б'=>'b','В'=>'v','в'=>'v','Г'=>'g','г'=>'g','Ґ'=>'g','ґ'=>'g',
'Д'=>'d','д'=>'d','Е'=>'e','е'=>'e','Ё'=>'jo','ё'=>'jo','Є'=>'e','є'=>'e','Ж'=>'zh','ж'=>'zh','З'=>'z','з'=>'z',
'И'=>'i','и'=>'i','І'=>'i','і'=>'i','Ї'=>'i','ї'=>'i','Й'=>'j','й'=>'j','К'=>'k','к'=>'k','Л'=>'l','л'=>'l',
'М'=>'m','м'=>'m','Н'=>'n','н'=>'n','О'=>'o','о'=>'o','П'=>'p','п'=>'p','Р'=>'r','р'=>'r','С'=>'s','с'=>'s',
'Т'=>'t','т'=>'t','У'=>'u','у'=>'u','Ф'=>'f','ф'=>'f','Х'=>'h','х'=>'h','Ц'=>'c','ц'=>'c','Ч'=>'ch','ч'=>'ch',
'Ш'=>'sh','ш'=>'sh','Щ'=>'sch','щ'=>'sch','Ъ'=>'-',
'ъ'=>'-','Ы'=>'y','ы'=>'y','Ь'=>'-','ь'=>'-',
'Э'=>'je','э'=>'je','Ю'=>'ju','ю'=>'ju','Я'=>'ja','я'=>'ja','א'=>'a','ב'=>'b','ג'=>'g','ד'=>'d','ה'=>'h','ו'=>'v',
'ז'=>'z','ח'=>'h','ט'=>'t','י'=>'i','ך'=>'k','כ'=>'k','ל'=>'l','ם'=>'m','מ'=>'m','ן'=>'n','נ'=>'n','ס'=>'s','ע'=>'e',
'ף'=>'p','פ'=>'p','ץ'=>'C','צ'=>'c','ק'=>'q','ר'=>'r','ש'=>'w','ת'=>'t'
);
$accentsRemoved = strtr( $stringToRemoveAccents , $unwanted_array );
An improved version of remove_accents()
function according to last version Wordpress 4.3 formatting is:
function mbstring_binary_safe_encoding( $reset = false ) {
static $encodings = array();
static $overloaded = null;
if ( is_null( $overloaded ) )
$overloaded = function_exists( 'mb_internal_encoding' ) && ( ini_get( 'mbstring.func_overload' ) & 2 );
if ( false === $overloaded )
return;
if ( ! $reset ) {
$encoding = mb_internal_encoding();
array_push( $encodings, $encoding );
mb_internal_encoding( 'ISO-8859-1' );
}
if ( $reset && $encodings ) {
$encoding = array_pop( $encodings );
mb_internal_encoding( $encoding );
}
}
function reset_mbstring_encoding() {
mbstring_binary_safe_encoding( true );
}
function seems_utf8( $str ) {
mbstring_binary_safe_encoding();
$length = strlen($str);
reset_mbstring_encoding();
for ($i=0; $i < $length; $i++) {
$c = ord($str[$i]);
if ($c < 0x80) $n = 0; // 0bbbbbbb
elseif (($c & 0xE0) == 0xC0) $n=1; // 110bbbbb
elseif (($c & 0xF0) == 0xE0) $n=2; // 1110bbbb
elseif (($c & 0xF8) == 0xF0) $n=3; // 11110bbb
elseif (($c & 0xFC) == 0xF8) $n=4; // 111110bb
elseif (($c & 0xFE) == 0xFC) $n=5; // 1111110b
else return false; // Does not match any model
for ($j=0; $j<$n; $j++) { // n bytes matching 10bbbbbb follow ?
if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}
function remove_accents( $string ) {
if ( !preg_match('/[\x80-\xff]/', $string) )
return $string;
if (seems_utf8($string)) {
$chars = array(
// Decompositions for Latin-1 Supplement
chr(194).chr(170) => 'a', chr(194).chr(186) => 'o',
chr(195).chr(128) => 'A', chr(195).chr(129) => 'A',
chr(195).chr(130) => 'A', chr(195).chr(131) => 'A',
chr(195).chr(132) => 'A', chr(195).chr(133) => 'A',
chr(195).chr(134) => 'AE',chr(195).chr(135) => 'C',
chr(195).chr(136) => 'E', chr(195).chr(137) => 'E',
chr(195).chr(138) => 'E', chr(195).chr(139) => 'E',
chr(195).chr(140) => 'I', chr(195).chr(141) => 'I',
chr(195).chr(142) => 'I', chr(195).chr(143) => 'I',
chr(195).chr(144) => 'D', chr(195).chr(145) => 'N',
chr(195).chr(146) => 'O', chr(195).chr(147) => 'O',
chr(195).chr(148) => 'O', chr(195).chr(149) => 'O',
chr(195).chr(150) => 'O', chr(195).chr(153) => 'U',
chr(195).chr(154) => 'U', chr(195).chr(155) => 'U',
chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y',
chr(195).chr(158) => 'TH',chr(195).chr(159) => 's',
chr(195).chr(160) => 'a', chr(195).chr(161) => 'a',
chr(195).chr(162) => 'a', chr(195).chr(163) => 'a',
chr(195).chr(164) => 'a', chr(195).chr(165) => 'a',
chr(195).chr(166) => 'ae',chr(195).chr(167) => 'c',
chr(195).chr(168) => 'e', chr(195).chr(169) => 'e',
chr(195).chr(170) => 'e', chr(195).chr(171) => 'e',
chr(195).chr(172) => 'i', chr(195).chr(173) => 'i',
chr(195).chr(174) => 'i', chr(195).chr(175) => 'i',
chr(195).chr(176) => 'd', chr(195).chr(177) => 'n',
chr(195).chr(178) => 'o', chr(195).chr(179) => 'o',
chr(195).chr(180) => 'o', chr(195).chr(181) => 'o',
chr(195).chr(182) => 'o', chr(195).chr(184) => 'o',
chr(195).chr(185) => 'u', chr(195).chr(186) => 'u',
chr(195).chr(187) => 'u', chr(195).chr(188) => 'u',
chr(195).chr(189) => 'y', chr(195).chr(190) => 'th',
chr(195).chr(191) => 'y', chr(195).chr(152) => 'O',
// Decompositions for Latin Extended-A
chr(196).chr(128) => 'A', chr(196).chr(129) => 'a',
chr(196).chr(130) => 'A', chr(196).chr(131) => 'a',
chr(196).chr(132) => 'A', chr(196).chr(133) => 'a',
chr(196).chr(134) => 'C', chr(196).chr(135) => 'c',
chr(196).chr(136) => 'C', chr(196).chr(137) => 'c',
chr(196).chr(138) => 'C', chr(196).chr(139) => 'c',
chr(196).chr(140) => 'C', chr(196).chr(141) => 'c',
chr(196).chr(142) => 'D', chr(196).chr(143) => 'd',
chr(196).chr(144) => 'D', chr(196).chr(145) => 'd',
chr(196).chr(146) => 'E', chr(196).chr(147) => 'e',
chr(196).chr(148) => 'E', chr(196).chr(149) => 'e',
chr(196).chr(150) => 'E', chr(196).chr(151) => 'e',
chr(196).chr(152) => 'E', chr(196).chr(153) => 'e',
chr(196).chr(154) => 'E', chr(196).chr(155) => 'e',
chr(196).chr(156) => 'G', chr(196).chr(157) => 'g',
chr(196).chr(158) => 'G', chr(196).chr(159) => 'g',
chr(196).chr(160) => 'G', chr(196).chr(161) => 'g',
chr(196).chr(162) => 'G', chr(196).chr(163) => 'g',
chr(196).chr(164) => 'H', chr(196).chr(165) => 'h',
chr(196).chr(166) => 'H', chr(196).chr(167) => 'h',
chr(196).chr(168) => 'I', chr(196).chr(169) => 'i',
chr(196).chr(170) => 'I', chr(196).chr(171) => 'i',
chr(196).chr(172) => 'I', chr(196).chr(173) => 'i',
chr(196).chr(174) => 'I', chr(196).chr(175) => 'i',
chr(196).chr(176) => 'I', chr(196).chr(177) => 'i',
chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij',
chr(196).chr(180) => 'J', chr(196).chr(181) => 'j',
chr(196).chr(182) => 'K', chr(196).chr(183) => 'k',
chr(196).chr(184) => 'k', chr(196).chr(185) => 'L',
chr(196).chr(186) => 'l', chr(196).chr(187) => 'L',
chr(196).chr(188) => 'l', chr(196).chr(189) => 'L',
chr(196).chr(190) => 'l', chr(196).chr(191) => 'L',
chr(197).chr(128) => 'l', chr(197).chr(129) => 'L',
chr(197).chr(130) => 'l', chr(197).chr(131) => 'N',
chr(197).chr(132) => 'n', chr(197).chr(133) => 'N',
chr(197).chr(134) => 'n', chr(197).chr(135) => 'N',
chr(197).chr(136) => 'n', chr(197).chr(137) => 'N',
chr(197).chr(138) => 'n', chr(197).chr(139) => 'N',
chr(197).chr(140) => 'O', chr(197).chr(141) => 'o',
chr(197).chr(142) => 'O', chr(197).chr(143) => 'o',
chr(197).chr(144) => 'O', chr(197).chr(145) => 'o',
chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe',
chr(197).chr(148) => 'R',chr(197).chr(149) => 'r',
chr(197).chr(150) => 'R',chr(197).chr(151) => 'r',
chr(197).chr(152) => 'R',chr(197).chr(153) => 'r',
chr(197).chr(154) => 'S',chr(197).chr(155) => 's',
chr(197).chr(156) => 'S',chr(197).chr(157) => 's',
chr(197).chr(158) => 'S',chr(197).chr(159) => 's',
chr(197).chr(160) => 'S', chr(197).chr(161) => 's',
chr(197).chr(162) => 'T', chr(197).chr(163) => 't',
chr(197).chr(164) => 'T', chr(197).chr(165) => 't',
chr(197).chr(166) => 'T', chr(197).chr(167) => 't',
chr(197).chr(168) => 'U', chr(197).chr(169) => 'u',
chr(197).chr(170) => 'U', chr(197).chr(171) => 'u',
chr(197).chr(172) => 'U', chr(197).chr(173) => 'u',
chr(197).chr(174) => 'U', chr(197).chr(175) => 'u',
chr(197).chr(176) => 'U', chr(197).chr(177) => 'u',
chr(197).chr(178) => 'U', chr(197).chr(179) => 'u',
chr(197).chr(180) => 'W', chr(197).chr(181) => 'w',
chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y',
chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z',
chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z',
chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z',
chr(197).chr(190) => 'z', chr(197).chr(191) => 's',
// Decompositions for Latin Extended-B
chr(200).chr(152) => 'S', chr(200).chr(153) => 's',
chr(200).chr(154) => 'T', chr(200).chr(155) => 't',
// Euro Sign
chr(226).chr(130).chr(172) => 'E',
// GBP (Pound) Sign
chr(194).chr(163) => '',
// Vowels with diacritic (Vietnamese)
// unmarked
chr(198).chr(160) => 'O', chr(198).chr(161) => 'o',
chr(198).chr(175) => 'U', chr(198).chr(176) => 'u',
// grave accent
chr(225).chr(186).chr(166) => 'A', chr(225).chr(186).chr(167) => 'a',
chr(225).chr(186).chr(176) => 'A', chr(225).chr(186).chr(177) => 'a',
chr(225).chr(187).chr(128) => 'E', chr(225).chr(187).chr(129) => 'e',
chr(225).chr(187).chr(146) => 'O', chr(225).chr(187).chr(147) => 'o',
chr(225).chr(187).chr(156) => 'O', chr(225).chr(187).chr(157) => 'o',
chr(225).chr(187).chr(170) => 'U', chr(225).chr(187).chr(171) => 'u',
chr(225).chr(187).chr(178) => 'Y', chr(225).chr(187).chr(179) => 'y',
// hook
chr(225).chr(186).chr(162) => 'A', chr(225).chr(186).chr(163) => 'a',
chr(225).chr(186).chr(168) => 'A', chr(225).chr(186).chr(169) => 'a',
chr(225).chr(186).chr(178) => 'A', chr(225).chr(186).chr(179) => 'a',
chr(225).chr(186).chr(186) => 'E', chr(225).chr(186).chr(187) => 'e',
chr(225).chr(187).chr(130) => 'E', chr(225).chr(187).chr(131) => 'e',
chr(225).chr(187).chr(136) => 'I', chr(225).chr(187).chr(137) => 'i',
chr(225).chr(187).chr(142) => 'O', chr(225).chr(187).chr(143) => 'o',
chr(225).chr(187).chr(148) => 'O', chr(225).chr(187).chr(149) => 'o',
chr(225).chr(187).chr(158) => 'O', chr(225).chr(187).chr(159) => 'o',
chr(225).chr(187).chr(166) => 'U', chr(225).chr(187).chr(167) => 'u',
chr(225).chr(187).chr(172) => 'U', chr(225).chr(187).chr(173) => 'u',
chr(225).chr(187).chr(182) => 'Y', chr(225).chr(187).chr(183) => 'y',
// tilde
chr(225).chr(186).chr(170) => 'A', chr(225).chr(186).chr(171) => 'a',
chr(225).chr(186).chr(180) => 'A', chr(225).chr(186).chr(181) => 'a',
chr(225).chr(186).chr(188) => 'E', chr(225).chr(186).chr(189) => 'e',
chr(225).chr(187).chr(132) => 'E', chr(225).chr(187).chr(133) => 'e',
chr(225).chr(187).chr(150) => 'O', chr(225).chr(187).chr(151) => 'o',
chr(225).chr(187).chr(160) => 'O', chr(225).chr(187).chr(161) => 'o',
chr(225).chr(187).chr(174) => 'U', chr(225).chr(187).chr(175) => 'u',
chr(225).chr(187).chr(184) => 'Y', chr(225).chr(187).chr(185) => 'y',
// acute accent
chr(225).chr(186).chr(164) => 'A', chr(225).chr(186).chr(165) => 'a',
chr(225).chr(186).chr(174) => 'A', chr(225).chr(186).chr(175) => 'a',
chr(225).chr(186).chr(190) => 'E', chr(225).chr(186).chr(191) => 'e',
chr(225).chr(187).chr(144) => 'O', chr(225).chr(187).chr(145) => 'o',
chr(225).chr(187).chr(154) => 'O', chr(225).chr(187).chr(155) => 'o',
chr(225).chr(187).chr(168) => 'U', chr(225).chr(187).chr(169) => 'u',
// dot below
chr(225).chr(186).chr(160) => 'A', chr(225).chr(186).chr(161) => 'a',
chr(225).chr(186).chr(172) => 'A', chr(225).chr(186).chr(173) => 'a',
chr(225).chr(186).chr(182) => 'A', chr(225).chr(186).chr(183) => 'a',
chr(225).chr(186).chr(184) => 'E', chr(225).chr(186).chr(185) => 'e',
chr(225).chr(187).chr(134) => 'E', chr(225).chr(187).chr(135) => 'e',
chr(225).chr(187).chr(138) => 'I', chr(225).chr(187).chr(139) => 'i',
chr(225).chr(187).chr(140) => 'O', chr(225).chr(187).chr(141) => 'o',
chr(225).chr(187).chr(152) => 'O', chr(225).chr(187).chr(153) => 'o',
chr(225).chr(187).chr(162) => 'O', chr(225).chr(187).chr(163) => 'o',
chr(225).chr(187).chr(164) => 'U', chr(225).chr(187).chr(165) => 'u',
chr(225).chr(187).chr(176) => 'U', chr(225).chr(187).chr(177) => 'u',
chr(225).chr(187).chr(180) => 'Y', chr(225).chr(187).chr(181) => 'y',
// Vowels with diacritic (Chinese, Hanyu Pinyin)
chr(201).chr(145) => 'a',
// macron
chr(199).chr(149) => 'U', chr(199).chr(150) => 'u',
// acute accent
chr(199).chr(151) => 'U', chr(199).chr(152) => 'u',
// caron
chr(199).chr(141) => 'A', chr(199).chr(142) => 'a',
chr(199).chr(143) => 'I', chr(199).chr(144) => 'i',
chr(199).chr(145) => 'O', chr(199).chr(146) => 'o',
chr(199).chr(147) => 'U', chr(199).chr(148) => 'u',
chr(199).chr(153) => 'U', chr(199).chr(154) => 'u',
// grave accent
chr(199).chr(155) => 'U', chr(199).chr(156) => 'u',
);
$string = strtr($string, $chars);
} else {
$chars = array();
// Assume ISO-8859-1 if not UTF-8
$chars['in'] = chr(128).chr(131).chr(138).chr(142).chr(154).chr(158)
.chr(159).chr(162).chr(165).chr(181).chr(192).chr(193).chr(194)
.chr(195).chr(196).chr(197).chr(199).chr(200).chr(201).chr(202)
.chr(203).chr(204).chr(205).chr(206).chr(207).chr(209).chr(210)
.chr(211).chr(212).chr(213).chr(214).chr(216).chr(217).chr(218)
.chr(219).chr(220).chr(221).chr(224).chr(225).chr(226).chr(227)
.chr(228).chr(229).chr(231).chr(232).chr(233).chr(234).chr(235)
.chr(236).chr(237).chr(238).chr(239).chr(241).chr(242).chr(243)
.chr(244).chr(245).chr(246).chr(248).chr(249).chr(250).chr(251)
.chr(252).chr(253).chr(255);
$chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";
$string = strtr($string, $chars['in'], $chars['out']);
$double_chars = array();
$double_chars['in'] = array(chr(140), chr(156), chr(198), chr(208), chr(222), chr(223), chr(230), chr(240), chr(254));
$double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
$string = str_replace($double_chars['in'], $double_chars['out'], $string);
}
return $string;
}
My answer is an update of @dynamic solution since Romanian or perhaps other language diacritics weren't converted. I wrote the minimum functions and works like a charm.
print_r(remove_accents('Iași, Iași County, Romania'));
<?php
/*
* Thanks:
* - The idea of extracting accents equiv chars with the help of the HTMLSpecialChars convertion was taking from ICanBoogie Package of 'Olivier Laviale' {@link http://www.weirdog.com/blog/php/supprimer-les-accents-des-caracteres-accentues.html}
*/
function accentCharsModifier($str){
if(($length=mb_strlen($str,"UTF-8"))<strlen($str)){
$i=$count=0;
while($i<$length){
if(strlen($c=mb_substr($str,$i,1,"UTF-8"))>1){
$he=htmlentities($c);
if(($nC=preg_replace("#&([A-Za-z])(?:acute|cedil|caron|circ|grave|orn|ring|slash|th|tilde|uml);#", "\\1", $he))!=$he ||
($nC=preg_replace("#&([A-Za-z]{2})(?:lig);#", "\\1", $he))!=$he ||
($nC=preg_replace("#&[^;]+;#", "", $he))!=$he){
$str=str_replace($c,$nC,$str,$count);if($nC==""){$length=$length-$count;$i--;}
}
}
$i++;
}
}
return $str;
}
echo accentCharsModifier("&éôpkAÈû");
?>
Based on @Mimouni answer I made this function to transliterate Accented strings to Non Accented strings.
/**
* @param $str Convert string to lowercase and replace special chars to equivalents ou remove its
* @return string
*/
function _slugify(string $string): string
{
$str = $string; // for comparisons
$str = _toUtf8($str); // Force to work with string in UTF-8
$str = iconv('UTF-8', 'ASCII//TRANSLIT', $str);
if ($str != htmlentities($string, ENT_QUOTES, 'UTF-8')) { // iconv fails
$str = _toUtf8($string);
$str = htmlentities($str, ENT_QUOTES, 'UTF-8');
$str = preg_replace('#&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);#i', '$1', $str);
// Need to strip non ASCII chars or any other than a-z, A-Z, 0-9...
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
$str = preg_replace(array('#[^0-9a-z]#i', '#[ -]+#'), ' ', $str);
$str = trim($str, ' -');
}
// lowercase
$string = strtolower($str);
return $string;
}
To convert strings to UTF-8, here I use the Multi Byte String extension. Note that I break string in pieces to avoid trouble with mixed content (I have such situation) and convert word by word.
/**
* @param $str string String in any encoding
* @return string
*/
function _toUtf8(string $str_in): ?string
{
if (!function_exists('mb_detect_encoding')) {
throw new \Exception('The Multi Byte String extension is absent!');
}
$str_out = [];
$words = explode(" ", $str_in);
foreach ($words as $word) {
$current_encoding = mb_detect_encoding($word, 'UTF-8, ASCII, ISO-8859-1');
$str_out[] = mb_convert_encoding($word, 'UTF-8', $current_encoding);
}
return implode(" ", $str_out);
}
Footer Notes: Was the only solution that pass in PHPUnit UnitTests in Windows command Line (locale issues) The @gabo solution should work but unfortunately not for me
One of the tricks I stumbled upon on the web was using htmlentities then stripping the encoded character :
$stripped = preg_replace('`&[^;]+;`','',htmlentities($string));
Not perfect but it does work well in some case.
But, you're writing about creating an URL string, so urlencode and its counterpart urldecode may be better. Or, if you are creating a query string, use this last function : http_build_query.
WordPress' implementation is definitly the safest for UTF8 strings. For Latin1 strings, a simple strtr does the job, but ensure you're saving your script in LATIN1 format, not UTF-8.
来源:https://stackoverflow.com/questions/1017599/how-do-i-remove-accents-from-characters-in-a-php-string