Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)
From recent patch to Drupal's Feeds JSON parser module:
//remove everything except valid letters (from any language)
$raw = preg_replace('/(?:\\\\u[\pL\p{Zs}])+/', '', $raw);
If you're concerned yes it retains spaces as valid characters.
Did what I needed. It removes widespread nowadays emoji-characters that don't fit into MySQL's 'utf8' character set and that gave me errors like "SQLSTATE[HY000]: General error: 1366 Incorrect string value".
For details see https://www.drupal.org/node/1824506#comment-6881382
Slightly different to the question, but what I am doing is to use HtmlEncode(string),
pseudo code here
var encoded = HtmlEncode(string);
encoded = Regex.Replace(encoded, "&#\d+?;", "");
var result = HtmlDecode(encoded);
input and output
"Headlight\x007E Bracket, { Cafe Racer<> Style, Stainless Steel 中文呢?"
"Headlight~ Bracket, { Cafe Racer<> Style, Stainless Steel 中文呢?"
I know it's not perfect, but does the job for me.
static $preg = <<<'END'
%(
[\x09\x0A\x0D\x20-\x7E]
| [\xC2-\xDF][\x80-\xBF]
| \xE0[\xA0-\xBF][\x80-\xBF]
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| \xF0[\x90-\xBF][\x80-\xBF]{2}
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
)%xs
END;
if (preg_match_all($preg, $string, $match)) {
$string = implode('', $match[0]);
} else {
$string = '';
}
it work on our service
try this:
$string = iconv("UTF-8","UTF-8//IGNORE",$string);
According to the iconv manual, the function will take the first parameter as the input charset, second parameter as the output charset, and the third as the actual input string.
If you set both the input and output charset to UTF-8, and append the //IGNORE
flag to the output charset, the function will drop(strip) all characters in the input string that can't be represented by the output charset. Thus, filtering the input string in effect.
UConverter can be used since PHP 5.5. UConverter is better the choice if you use intl extension and don't use mbstring.
function replace_invalid_byte_sequence($str)
{
return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}
function replace_invalid_byte_sequence2($str)
{
return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}
htmlspecialchars can be used to remove invalid byte sequence since PHP 5.4. Htmlspecialchars is better than preg_match for handling large size of byte and the accuracy. A lot of the wrong implementation by using regular expression can be seen.
function replace_invalid_byte_sequence3($str)
{
return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}
Maybe not the most precise solution, but it gets the job done with a single line of code:
echo str_replace("?","",(utf8_decode($str)));
utf8_decode
will convert the characters to a question mark;
str_replace
will strip out the question marks.