Without using u
flag the hex range that can be used is [\\x{00}-\\x{ff}]
, but with u
flag it goes up to a 4-byte value \\x{7fffffff}
So I can't match a letter like
I'm not sure about php but there really is no governor on code points
so it doesn't matter that there are only some 1.1 million valid ones.
That is subject to change at any time, but its not really up to engines
to enforce that. There are reserved cp's that are holes in the valid range,
there are surrogates in the valid range, the reasons are endless for there
to be no other restriction other than the word size.
For UTF-32, you can't go over 31 bits because 32 is the sign bit.
0x00000000 - 0x7FFFFFFF
Makes sense since unsigned int
as a data type is the natural size of 32-bit hardware registers.
For UTF-16, even truer you can see the same limitation masked to 16 bit.
Bit 32 is still the sign bit leaving 0x0000 - 0xFFFF
as a valid range.
Usually, if you use an engine that supports ICU you should be able to use it,
which converts both source and regex into UTF-32. Boost Regex is one such engine.
edit:
Regarding UTF-16
I guess when Unicode outgrew 16 bit, they punched a hole in the 16-bit range for surrogate pairs. But it left only 20 total bits between the pair as useable.
10 bits in each surrogate with the other 6 used to determine hi or lo.
Looks like this left the Unicode folks with a limit of 20 bits + an extra 0xFFFF rounded, to a total of 0x10FFFF codepoints, with unusable holes.
To be able to convert to a different encoding (8/16/32) all the codepoints
must actually be convertible. Thus the forever backward compatibile 20-bit is
the trap they ran into early, but now must live with.
Regardless, regex engines won't be enforcing this limit anytime soon, probably never.
As far as surrogates, they are the hole, and an mal-formed literal surrogate can't be converted between modes. That just pertains to a literal encoded character during conversion, not a hex representation of one. For instance its easy to search a text in UTF-16 (only) mode for unpaired surrogates, or even paired one's.
But I guess regex engines don't really care about holes or limits, they only care about what mode the subject string is in. No, the engine is not going to say:
'Hey wait, the mode is UTF-16 I better convert \x{210C1}
to \x{D844}\x{DCC1}
. Wait, if I did that, what do I do if its quantified \x{210C1}+
,start injecting regex constructs around it? Worse yet, what if its in a class [\x{210C1}]
? Nah.. better limit it to \x{FFFF}
.
Some handy dandy, pseudo-code surrogate conversions I use:
Definitions:
====================
10-bits
3FF = 000000 1111111111
Hi Surrogate
D800 = 110110 0000000000
DBFF = 110110 1111111111
Lo Surrogate
DC00 = 110111 0000000000
DFFF = 110111 1111111111
Conversions:
====================
UTF-16 Surrogates to UTF-32
if ( TESTFOR_SURROGATE_PAIR(hi,lo) )
{
u32Out = 0x10000 + ( ((hi & 0x3FF) << 10) | (lo & 0x3FF) );
}
UTF-32 to UTF-16 Surrogates
if ( u32In >= 0x10000)
{
u32In -= 0x10000;
hi = (0xD800 + ((u32In & 0xFFC00) >> 10));
lo = (0xDC00 + (u32In & 0x3FF));
}
Macro's:
====================
#define TESTFOR_SURROGATE_HI(hs) (((hs & 0xFC00)) == 0xD800 )
#define TESTFOR_SURROGATE_LO(ls) (((ls & 0xFC00)) == 0xDC00 )
#define TESTFOR_SURROGATE_PAIR(hs,ls) ( (((hs & 0xFC00)) == 0xD800) && (((ls & 0xFC00)) == 0xDC00) )
//
#define PTR_TESTFOR_SURROGATE_HI(ptr) (((*ptr & 0xFC00)) == 0xD800 )
#define PTR_TESTFOR_SURROGATE_LO(ptr) (((*ptr & 0xFC00)) == 0xDC00 )
#define PTR_TESTFOR_SURROGATE_PAIR(ptr) ( (((*ptr & 0xFC00)) == 0xD800) && (((*(ptr+1) & 0xFC00)) == 0xDC00) )
As minitech suggests in the first comment, you have to use the codepoint - for this character, it's \x{210C1}
. That's also the encoded form in UTF-32.
F0 AF AB BF
is the UTF-8 encoded sequence (see http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=210C1).
There are some versions of PCRE where you can use values up to \x{7FFFFFFF}
. But I really don't know what could be matched with it.
To quote http://www.pcre.org/pcre.txt:
In UTF-16 mode, the character code is Unicode, in the range 0 to 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff because those are "surrogate" values that are used in pairs to encode values greater than 0xffff.
[...]
In UTF-32 mode, the character code is Unicode, in the range 0 to 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff because those are "surrogate" values that are ill-formed in UTF-32.
0x10ffff
is the largest value you can use to match a character (that's what I extract from this). 0x10ffff
is currently also the largest code point defined in the unicode standard (see What are some of the differences between the UTFs?) - thus every value above does not make any sense (or I just don't get it)...
"but want to know about the max hex boundary in a regex": * in all utf modes: 0x10ffff * native 8-bt mode: 0xff * native 16-bit mode: 0xffff * native 32-bit mode: 0x1fffffff
Unicode is a character set, which specifies a mapping from characters to code points, and the character encodings (UTF-8, UTF-16, UTF-32) specify how to store the Unicode code points.
In Unicode, a character maps to a single code point, but it can have different representation depending on how it is encoded.
I don't want to rehash this discussion all over again, so if you are still not clear about this, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Using the example in the question,