Remove non-utf8 characters from string

后端未结

关注

 18  1368

Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)

相关标签:

18条回答

别那么骄傲

2020-11-22 12:44
From recent patch to Drupal's Feeds JSON parser module:
```
//remove everything except valid letters (from any language)
$raw = preg_replace('/(?:\\\\u[\pL\p{Zs}])+/', '', $raw);
```
If you're concerned yes it retains spaces as valid characters.

Did what I needed. It removes widespread nowadays emoji-characters that don't fit into MySQL's 'utf8' character set and that gave me errors like "SQLSTATE[HY000]: General error: 1366 Incorrect string value".

For details see https://www.drupal.org/node/1824506#comment-6881382
0 讨论(0)
发布评论:

提交评论
- 加载中...

耶瑟儿～

2020-11-22 12:48

Slightly different to the question, but what I am doing is to use HtmlEncode(string),

pseudo code here

var encoded = HtmlEncode(string);
encoded = Regex.Replace(encoded, "&#\d+?;", "");
var result = HtmlDecode(encoded);

input and output

"Headlight\x007E Bracket, &#123; Cafe Racer<> Style,Â Stainless Steel 中文呢？"
"Headlight~ Bracket, &#123; Cafe Racer<> Style, Stainless Steel 中文呢？"

I know it's not perfect, but does the job for me.

0 讨论(0)

隐瞒了意图╮

2020-11-22 12:50

static $preg = <<<'END'
%(
[\x09\x0A\x0D\x20-\x7E]
| [\xC2-\xDF][\x80-\xBF]
| \xE0[\xA0-\xBF][\x80-\xBF]
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| \xF0[\x90-\xBF][\x80-\xBF]{2}
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
)%xs
END;
if (preg_match_all($preg, $string, $match)) {
    $string = implode('', $match[0]);
} else {
    $string = '';
}

it work on our service

0 讨论(0)

北海茫月

2020-11-22 12:56
try this:
```
$string = iconv("UTF-8","UTF-8//IGNORE",$string);
```
According to the iconv manual, the function will take the first parameter as the input charset, second parameter as the output charset, and the third as the actual input string.

If you set both the input and output charset to UTF-8, and append the //IGNORE flag to the output charset, the function will drop(strip) all characters in the input string that can't be represented by the output charset. Thus, filtering the input string in effect.
0 讨论(0)
发布评论:

提交评论
- 加载中...
栀梦

2020-11-22 12:56
UConverter can be used since PHP 5.5. UConverter is better the choice if you use intl extension and don't use mbstring.
```
function replace_invalid_byte_sequence($str)
{
    return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence2($str)
{
    return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}
```
htmlspecialchars can be used to remove invalid byte sequence since PHP 5.4. Htmlspecialchars is better than preg_match for handling large size of byte and the accuracy. A lot of the wrong implementation by using regular expression can be seen.
```
function replace_invalid_byte_sequence3($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
囚心锁ツ

2020-11-22 12:57
Maybe not the most precise solution, but it gets the job done with a single line of code:
```
echo str_replace("?","",(utf8_decode($str)));
```
utf8_decode will convert the characters to a question mark;
str_replace will strip out the question marks.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2 3