From the manual:
After
\\x
, up to two hexadecimal digits are read (letters can be in upper or lower case). In UTF-8 mode,\\x{...
The syntax is a way to specify a character by value:
\xAB
specifies a code-point in the range 0-FF.\x{ABCD}
specifies a code-point in the range 0-FFFF.The indicated wording from the manual is bit confusing, perhaps in an attempt to be precise. Character values 128-255 (and some) are encoded as 2-bytes in UTF-8. Thus, a unicode regular expression will match 7-bit clean ASCII but will not match different encodings/codepages (i.e. CP437) that utilize values in said range. The manual is, in a roundabout way, saying that a unicode regular expression is only suitable to be used with correctly encoded input. However;
It doesn't mean that \xABCD
is parsed as \x{ABCD}
(one character). It is parsed as \xAB
(one character) and then CD
(two characters)1. The braces address this parsing ambiguity issue:
After \x, up to two hexadecimal digits are read .. In UTF-8 mode, \x{...} is allowed ..
Some other languages use \u
instead of \x
for the longer form.
1 Consider that this matches:
preg_match('/\xC3A4/u', "\xC3" . "A4");