Remove non-utf8 characters from string

后端 未结 18 1474
心在旅途
心在旅途 2020-11-22 11:56

Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)

18条回答
  •  遇见更好的自我
    2020-11-22 12:38

    Welcome to 2019 and the /u modifier in regex which will handle UTF-8 multibyte chars for you

    If you only use mb_convert_encoding($value, 'UTF-8', 'UTF-8') you will still end up with non-printable chars in your string

    This method will:

    • Remove all invalid UTF-8 multibyte chars with mb_convert_encoding
    • Remove all non-printable chars like \r, \x00 (NULL-byte) and other control chars with preg_replace

    method:

    function utf8_filter(string $value): string{
        return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
    }
    

    [:print:] match all printable chars and \n newlines and strip everything else

    You can see the ASCII table below.. The printable chars range from 32 to 127, but newline \n is a part of the control chars which range from 0 to 31 so we have to add newline to the regex /[^[:print:]\n]/u

    You can try to send strings through the regex with chars outside the printable range like \x7F (DEL), \x1B (Esc) etc. and see how they are stripped

    function utf8_filter(string $value): string{
        return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
    }
    
    $arr = [
        'Danish chars'          => 'Hello from Denmark with æøå',
        'Non-printable chars'   => "\x7FHello with invalid chars\r \x00"
    ];
    
    foreach($arr as $k => $v){
        echo "$k:\n---------\n";
        
        $len = strlen($v);
        echo "$v\n(".$len.")\n";
        
        $strip = utf8_decode(utf8_filter(utf8_encode($v)));
        $strip_len = strlen($strip);
        echo $strip."\n(".$strip_len.")\n\n";
        
        echo "Chars removed: ".($len - $strip_len)."\n\n\n";
    }
    

    https://www.tehplayground.com/q5sJ3FOddhv1atpR

提交回复
热议问题