问题
anubhava's answer about matching ranges of unicode characters led me to the regex to use for cleaning up a specific range of single code point of characters. With it, now I can match all miscellaneous symbols in this list (includes emoticons) with this simple expression:
preg_replace('/[\x{2600}-\x{26FF}]/u', '', $str);
However, I also want to match those in this list of paired/double surrogates emoji, but as nhahtdh explained in a comment:
There is a range from
d800
todfff
to specify surrogates in UTF-16 to allow for more characters to be specified. A single surrogate is not a valid character in UTF-16 (a pair is necessary to specify a valid character).
So, for example, when I try this:
preg_replace('/\x{D83D}\x{DE00}/u', '', $str);
For replacing only the first of the paired surrogates on this list, i.e.: 😀
PHP throws this:
preg_replace()
: Compilation failed: disallowed Unicode code point(>= 0xd800 && <= 0xdfff)
I have tried several different combinations, including the supposed combination of the above code points in UTF8 for 😀 ('/[\x{00F0}\x{009F}\x{0098}\x{0080}]/u'
), but I was still unable to match it. I also looked into other PCRE pattern modifiers, but it seems u
is the only one that allows to point through UTF8.
Am I missing any "escape" alternative here?
回答1:
revo's comment above was very helpful to find a solution:
If your PHP isn't shipped with a PCRE build for UTF-16 then you can't perform such a match. From PHP 7.0 on, you're able to use Unicode code points following this syntax
\u{XXXX}
e.g.preg_replace("~\u{1F600}~", '', $str);
(Mind the double quotes)
Since I am using PHP 7, echo "\u{1F602}";
outputs 😂 according to this PHP RFC page on unicode escape. This proposal was in essence:
A new escape sequence is added for double-quoted strings and heredocs.
\u{ codepoint-digits }
wherecodepoint-digits
is composed of hexadecimal digits.
This implies that the matching string in preg_replace
(normally single-quoted for not messing up with double-quoted strings variable expansion), now needs some preg_quote magic. This is the solution I came up with:
preg_replace(
// single point unicode list
"/[\x{2600}-\x{26FF}".
// http://www.fileformat.info/info/unicode/block/miscellaneous_symbols/list.htm
// concatenates with paired surrogates
preg_quote("\u{1F600}", '/')."-".preg_quote("\u{1F64F}", '/').
// https://www.fileformat.info/info/unicode/block/emoticons/list.htm
"]/u",
'',
$str
);
Here's the proof of the above in 3v4l.
EDIT: a simpler solution
In another comment made by revo, it seems that by placing unicode characters directly into the regex character class, single-quoted strings and previous PHP versions (e.g. 4.3.4) are supported:
preg_replace('/[☀-⛿😀-🙏]/u','YOINK',$str);
For using PHP 7's new feature though, you still need double-quotes:
preg_replace("/[\u{2600}-\u{26FF}\u{1F600}-\u{1F64F}]/u",'YOINK',$str);
Here's revo's proof in 3v4l.
来源:https://stackoverflow.com/questions/51947319/php-how-to-match-a-range-of-unicode-paired-surrogates-emoticons-emoji