Maximum Hex value in regex

前端 未结 5 985
南旧
南旧 2021-02-12 13:27

Without using u flag the hex range that can be used is [\\x{00}-\\x{ff}], but with u flag it goes up to a 4-byte value \\x{7fffffff}

相关标签:
5条回答
  • 2021-02-12 13:48

    So I can't match a letter like

    0 讨论(0)
  • 2021-02-12 13:48

    I'm not sure about php but there really is no governor on code points
    so it doesn't matter that there are only some 1.1 million valid ones.
    That is subject to change at any time, but its not really up to engines
    to enforce that. There are reserved cp's that are holes in the valid range,
    there are surrogates in the valid range, the reasons are endless for there
    to be no other restriction other than the word size.

    For UTF-32, you can't go over 31 bits because 32 is the sign bit.
    0x00000000 - 0x7FFFFFFF

    Makes sense since unsigned int as a data type is the natural size of 32-bit hardware registers.

    For UTF-16, even truer you can see the same limitation masked to 16 bit. Bit 32 is still the sign bit leaving 0x0000 - 0xFFFF as a valid range.

    Usually, if you use an engine that supports ICU you should be able to use it,
    which converts both source and regex into UTF-32. Boost Regex is one such engine.

    edit:

    Regarding UTF-16

    I guess when Unicode outgrew 16 bit, they punched a hole in the 16-bit range for surrogate pairs. But it left only 20 total bits between the pair as useable.

    10 bits in each surrogate with the other 6 used to determine hi or lo.
    Looks like this left the Unicode folks with a limit of 20 bits + an extra 0xFFFF rounded, to a total of 0x10FFFF codepoints, with unusable holes.

    To be able to convert to a different encoding (8/16/32) all the codepoints
    must actually be convertible. Thus the forever backward compatibile 20-bit is
    the trap they ran into early, but now must live with.

    Regardless, regex engines won't be enforcing this limit anytime soon, probably never.
    As far as surrogates, they are the hole, and an mal-formed literal surrogate can't be converted between modes. That just pertains to a literal encoded character during conversion, not a hex representation of one. For instance its easy to search a text in UTF-16 (only) mode for unpaired surrogates, or even paired one's.

    But I guess regex engines don't really care about holes or limits, they only care about what mode the subject string is in. No, the engine is not going to say:
    'Hey wait, the mode is UTF-16 I better convert \x{210C1} to \x{D844}\x{DCC1}. Wait, if I did that, what do I do if its quantified \x{210C1}+,start injecting regex constructs around it? Worse yet, what if its in a class [\x{210C1}]? Nah.. better limit it to \x{FFFF}.

    Some handy dandy, pseudo-code surrogate conversions I use:

     Definitions:
     ====================
     10-bits
      3FF = 000000  1111111111
    
     Hi Surrogate
     D800 = 110110  0000000000
     DBFF = 110110  1111111111 
    
     Lo Surrogate
     DC00 = 110111  0000000000
     DFFF = 110111  1111111111
    
    
     Conversions:
     ====================
     UTF-16 Surrogates to UTF-32
     if ( TESTFOR_SURROGATE_PAIR(hi,lo) )
     {
        u32Out = 0x10000 + (  ((hi & 0x3FF) << 10) | (lo & 0x3FF)  );
     }
    
     UTF-32 to UTF-16 Surrogates
     if ( u32In >= 0x10000)
     {
        u32In -= 0x10000;
        hi = (0xD800 + ((u32In & 0xFFC00) >> 10));
        lo = (0xDC00 + (u32In & 0x3FF));
     }
    
     Macro's:
     ====================
     #define TESTFOR_SURROGATE_HI(hs) (((hs & 0xFC00)) == 0xD800 )
     #define TESTFOR_SURROGATE_LO(ls) (((ls & 0xFC00)) == 0xDC00 )
     #define TESTFOR_SURROGATE_PAIR(hs,ls) ( (((hs & 0xFC00)) == 0xD800) && (((ls & 0xFC00)) == 0xDC00) )
     //
     #define PTR_TESTFOR_SURROGATE_HI(ptr) (((*ptr & 0xFC00)) == 0xD800 )
     #define PTR_TESTFOR_SURROGATE_LO(ptr) (((*ptr & 0xFC00)) == 0xDC00 )
     #define PTR_TESTFOR_SURROGATE_PAIR(ptr) ( (((*ptr & 0xFC00)) == 0xD800) && (((*(ptr+1) & 0xFC00)) == 0xDC00) )
    
    0 讨论(0)
  • 2021-02-12 13:52

    As minitech suggests in the first comment, you have to use the codepoint - for this character, it's \x{210C1}. That's also the encoded form in UTF-32. F0 AF AB BF is the UTF-8 encoded sequence (see http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=210C1).

    There are some versions of PCRE where you can use values up to \x{7FFFFFFF}. But I really don't know what could be matched with it.

    To quote http://www.pcre.org/pcre.txt:

    In UTF-16 mode, the character code is Unicode, in the range 0 to 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff because those are "surrogate" values that are used in pairs to encode values greater than 0xffff.

    [...]

    In UTF-32 mode, the character code is Unicode, in the range 0 to 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff because those are "surrogate" values that are ill-formed in UTF-32.

    0x10ffff is the largest value you can use to match a character (that's what I extract from this). 0x10ffff is currently also the largest code point defined in the unicode standard (see What are some of the differences between the UTFs?) - thus every value above does not make any sense (or I just don't get it)...

    0 讨论(0)
  • 2021-02-12 13:58

    "but want to know about the max hex boundary in a regex": * in all utf modes: 0x10ffff * native 8-bt mode: 0xff * native 16-bit mode: 0xffff * native 32-bit mode: 0x1fffffff

    0 讨论(0)
  • 2021-02-12 14:08

    Unicode and UTF-8, UTF-16, UTF-32 encoding

    Unicode is a character set, which specifies a mapping from characters to code points, and the character encodings (UTF-8, UTF-16, UTF-32) specify how to store the Unicode code points.

    In Unicode, a character maps to a single code point, but it can have different representation depending on how it is encoded.

    I don't want to rehash this discussion all over again, so if you are still not clear about this, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

    Using the example in the question,

    0 讨论(0)
提交回复
热议问题