Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)
Welcome to 2019 and the /u
modifier in regex which will handle UTF-8 multibyte chars for you
If you only use mb_convert_encoding($value, 'UTF-8', 'UTF-8')
you will still end up with non-printable chars in your string
This method will:
mb_convert_encoding
\r
, \x00
(NULL-byte) and other control chars with preg_replace
function utf8_filter(string $value): string{
return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
}
[:print:]
match all printable chars and \n
newlines and strip everything else
You can see the ASCII table below.. The printable chars range from 32 to 127, but newline \n
is a part of the control chars which range from 0 to 31 so we have to add newline to the regex /[^[:print:]\n]/u
You can try to send strings through the regex with chars outside the printable range like \x7F
(DEL), \x1B
(Esc) etc. and see how they are stripped
function utf8_filter(string $value): string{
return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
}
$arr = [
'Danish chars' => 'Hello from Denmark with æøå',
'Non-printable chars' => "\x7FHello with invalid chars\r \x00"
];
foreach($arr as $k => $v){
echo "$k:\n---------\n";
$len = strlen($v);
echo "$v\n(".$len.")\n";
$strip = utf8_decode(utf8_filter(utf8_encode($v)));
$strip_len = strlen($strip);
echo $strip."\n(".$strip_len.")\n\n";
echo "Chars removed: ".($len - $strip_len)."\n\n\n";
}
https://www.tehplayground.com/q5sJ3FOddhv1atpR