Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)
To remove all Unicode characters outside of the Unicode basic language plane:
$str = preg_replace("/[^\\x00-\\xFFFF]/", "", $str);
If you apply utf8_encode()
to an already UTF8 string it will return a garbled UTF8 output.
I made a function that addresses all this issues. It´s called Encoding::toUTF8()
.
You dont need to know what the encoding of your strings is. It can be Latin1 (ISO8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8()
will convert everything to UTF8.
I did it because a service was giving me a feed of data all messed up, mixing those encodings in the same string.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::toUTF8($mixed_string);
$latin1_string = Encoding::toLatin1($mixed_string);
I've included another function, Encoding::fixUTF8(), which will fix every UTF8 string that looks garbled product of having been encoded into UTF8 multiple times.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Download:
https://github.com/neitanod/forceutf8
Welcome to 2019 and the /u
modifier in regex which will handle UTF-8 multibyte chars for you
If you only use mb_convert_encoding($value, 'UTF-8', 'UTF-8')
you will still end up with non-printable chars in your string
This method will:
mb_convert_encoding
\r
, \x00
(NULL-byte) and other control chars with preg_replace
function utf8_filter(string $value): string{
return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
}
[:print:]
match all printable chars and \n
newlines and strip everything else
You can see the ASCII table below.. The printable chars range from 32 to 127, but newline \n
is a part of the control chars which range from 0 to 31 so we have to add newline to the regex /[^[:print:]\n]/u
You can try to send strings through the regex with chars outside the printable range like \x7F
(DEL), \x1B
(Esc) etc. and see how they are stripped
function utf8_filter(string $value): string{
return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
}
$arr = [
'Danish chars' => 'Hello from Denmark with æøå',
'Non-printable chars' => "\x7FHello with invalid chars\r \x00"
];
foreach($arr as $k => $v){
echo "$k:\n---------\n";
$len = strlen($v);
echo "$v\n(".$len.")\n";
$strip = utf8_decode(utf8_filter(utf8_encode($v)));
$strip_len = strlen($strip);
echo $strip."\n(".$strip_len.")\n\n";
echo "Chars removed: ".($len - $strip_len)."\n\n\n";
}
https://www.tehplayground.com/q5sJ3FOddhv1atpR
$text = iconv("UTF-8", "UTF-8//IGNORE", $text);
This is what I am using. Seems to work pretty well. Taken from http://planetozh.com/blog/2005/01/remove-invalid-characters-in-utf-8/
The text may contain non-utf8 character. Try to do first:
$nonutf8 = mb_convert_encoding($nonutf8 , 'UTF-8', 'UTF-8');
You can read more about it here: http://php.net/manual/en/function.mb-convert-encoding.phpnews
$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));