Remove non-utf8 characters from string

后端 未结 18 1423
心在旅途
心在旅途 2020-11-22 11:56

Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)

相关标签:
18条回答
  • 2020-11-22 12:37

    To remove all Unicode characters outside of the Unicode basic language plane:

    $str = preg_replace("/[^\\x00-\\xFFFF]/", "", $str);
    
    0 讨论(0)
  • 2020-11-22 12:38

    If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.

    I made a function that addresses all this issues. It´s called Encoding::toUTF8().

    You dont need to know what the encoding of your strings is. It can be Latin1 (ISO8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF8.

    I did it because a service was giving me a feed of data all messed up, mixing those encodings in the same string.

    Usage:

    require_once('Encoding.php'); 
    use \ForceUTF8\Encoding;  // It's namespaced now.
    
    $utf8_string = Encoding::toUTF8($mixed_string);
    
    $latin1_string = Encoding::toLatin1($mixed_string);
    

    I've included another function, Encoding::fixUTF8(), which will fix every UTF8 string that looks garbled product of having been encoded into UTF8 multiple times.

    Usage:

    require_once('Encoding.php'); 
    use \ForceUTF8\Encoding;  // It's namespaced now.
    
    $utf8_string = Encoding::fixUTF8($garbled_utf8_string);
    

    Examples:

    echo Encoding::fixUTF8("Fédération Camerounaise de Football");
    echo Encoding::fixUTF8("Fédération Camerounaise de Football");
    echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
    echo Encoding::fixUTF8("Fédération Camerounaise de Football");
    

    will output:

    Fédération Camerounaise de Football
    Fédération Camerounaise de Football
    Fédération Camerounaise de Football
    Fédération Camerounaise de Football
    

    Download:

    https://github.com/neitanod/forceutf8

    0 讨论(0)
  • 2020-11-22 12:38

    Welcome to 2019 and the /u modifier in regex which will handle UTF-8 multibyte chars for you

    If you only use mb_convert_encoding($value, 'UTF-8', 'UTF-8') you will still end up with non-printable chars in your string

    This method will:

    • Remove all invalid UTF-8 multibyte chars with mb_convert_encoding
    • Remove all non-printable chars like \r, \x00 (NULL-byte) and other control chars with preg_replace

    method:

    function utf8_filter(string $value): string{
        return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
    }
    

    [:print:] match all printable chars and \n newlines and strip everything else

    You can see the ASCII table below.. The printable chars range from 32 to 127, but newline \n is a part of the control chars which range from 0 to 31 so we have to add newline to the regex /[^[:print:]\n]/u

    You can try to send strings through the regex with chars outside the printable range like \x7F (DEL), \x1B (Esc) etc. and see how they are stripped

    function utf8_filter(string $value): string{
        return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
    }
    
    $arr = [
        'Danish chars'          => 'Hello from Denmark with æøå',
        'Non-printable chars'   => "\x7FHello with invalid chars\r \x00"
    ];
    
    foreach($arr as $k => $v){
        echo "$k:\n---------\n";
        
        $len = strlen($v);
        echo "$v\n(".$len.")\n";
        
        $strip = utf8_decode(utf8_filter(utf8_encode($v)));
        $strip_len = strlen($strip);
        echo $strip."\n(".$strip_len.")\n\n";
        
        echo "Chars removed: ".($len - $strip_len)."\n\n\n";
    }
    

    https://www.tehplayground.com/q5sJ3FOddhv1atpR

    0 讨论(0)
  • 2020-11-22 12:39
    $text = iconv("UTF-8", "UTF-8//IGNORE", $text);
    

    This is what I am using. Seems to work pretty well. Taken from http://planetozh.com/blog/2005/01/remove-invalid-characters-in-utf-8/

    0 讨论(0)
  • 2020-11-22 12:39

    The text may contain non-utf8 character. Try to do first:

    $nonutf8 = mb_convert_encoding($nonutf8 , 'UTF-8', 'UTF-8');
    

    You can read more about it here: http://php.net/manual/en/function.mb-convert-encoding.phpnews

    0 讨论(0)
  • 2020-11-22 12:39
    $string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));
    
    0 讨论(0)
提交回复
热议问题