Remove non-utf8 characters from string

后端未结

关注

 18  1479

Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)

相关标签:

18条回答

我寻月下人不归

2020-11-22 12:37
To remove all Unicode characters outside of the Unicode basic language plane:
```
$str = preg_replace("/[^\\x00-\\xFFFF]/", "", $str);
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2020-11-22 12:38
If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You dont need to know what the encoding of your strings is. It can be Latin1 (ISO8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing those encodings in the same string.

Usage:
```
require_once('Encoding.php'); 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($mixed_string);

$latin1_string = Encoding::toLatin1($mixed_string);
```
I've included another function, Encoding::fixUTF8(), which will fix every UTF8 string that looks garbled product of having been encoded into UTF8 multiple times.

Usage:
```
require_once('Encoding.php'); 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
```
Examples:
```
echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");
```
will output:
```
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
```
Download:

https://github.com/neitanod/forceutf8
0 讨论(0)
发布评论:

提交评论
- 加载中...
遇见更好的自我

2020-11-22 12:38
Welcome to 2019 and the /u modifier in regex which will handle UTF-8 multibyte chars for you

If you only use mb_convert_encoding($value, 'UTF-8', 'UTF-8') you will still end up with non-printable chars in your string

This method will:
- Remove all invalid UTF-8 multibyte chars with mb_convert_encoding
- Remove all non-printable chars like \r, \x00 (NULL-byte) and other control chars with preg_replace
method:
```
function utf8_filter(string $value): string{
    return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
}
```
[:print:] match all printable chars and \n newlines and strip everything else

You can see the ASCII table below.. The printable chars range from 32 to 127, but newline \n is a part of the control chars which range from 0 to 31 so we have to add newline to the regex /[^[:print:]\n]/u

You can try to send strings through the regex with chars outside the printable range like \x7F (DEL), \x1B (Esc) etc. and see how they are stripped
```
function utf8_filter(string $value): string{
    return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
}

$arr = [
    'Danish chars'          => 'Hello from Denmark with æøå',
    'Non-printable chars'   => "\x7FHello with invalid chars\r \x00"
];

foreach($arr as $k => $v){
    echo "$k:\n---------\n";
    
    $len = strlen($v);
    echo "$v\n(".$len.")\n";
    
    $strip = utf8_decode(utf8_filter(utf8_encode($v)));
    $strip_len = strlen($strip);
    echo $strip."\n(".$strip_len.")\n\n";
    
    echo "Chars removed: ".($len - $strip_len)."\n\n\n";
}
```
https://www.tehplayground.com/q5sJ3FOddhv1atpR
0 讨论(0)
发布评论:

提交评论
- 加载中...
谎友^

2020-11-22 12:39
```
$text = iconv("UTF-8", "UTF-8//IGNORE", $text);
```
This is what I am using. Seems to work pretty well. Taken from http://planetozh.com/blog/2005/01/remove-invalid-characters-in-utf-8/
0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-11-22 12:39
The text may contain non-utf8 character. Try to do first:
```
$nonutf8 = mb_convert_encoding($nonutf8 , 'UTF-8', 'UTF-8');
```
You can read more about it here: http://php.net/manual/en/function.mb-convert-encoding.phpnews
0 讨论(0)
发布评论:

提交评论
- 加载中...

失恋的感觉

2020-11-22 12:39

$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));

0 讨论(0)

Remove non-utf8 characters from string

method: