I am using a utf8 charset mysql tables in a mysql 5.1 server, which does not support utf8mb4 encoding in tables. When inserting 4-byte encoded utf8 characters like "𡃁","𨋢","𠵱","𥄫","𠽌","唧","𠱁"
. The table will popup error or skip the following texts.
How can I programmatically detect 4-byte encoded utf8 characters in PHP and replace them?
The following regular expression will replace 4-byte UTF-8 characters:
function replace4byte($string, $replacement = '') {
return preg_replace('%(?:
\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)%xs', $replacement, $string);
}
var_dump(replace4byte('d'), replace4byte('d𡃁d'));
This doesn't rely on the /u
modifier, so you shouldn't need to worry about UTF-8 for PCRE being compiled in. However, if you have that support, deceze's preg_replace_callback
is neater.
(Regex adapted from Ensuring valid utf-8 in PHP)
This should work:
if (max(array_map('ord', str_split($string))) >= 240)
The rational being that code points up to and including U+FFFF are encoded as three bytes of the form 1110xxxx 10xxxxxx 10xxxxxx
. Higher code points are of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
, i.e. the highest byte has a value of 240 or higher. If there are any such bytes in the string, it's an indicator for a 4-byte sequence.
If you want to remove long characters, this will do:
preg_replace_callback('/./u', function (array $match) {
return strlen($match[0]) >= 4 ? null : $match[0];
}, $string)
Though there may be a more elegant regex way to express high codepoints directly.
来源:https://stackoverflow.com/questions/16496554/can-php-detect-4-byte-encoded-utf8-chars