Maximum Hex value in regex

前端未结

关注

 5  985

Without using u flag the hex range that can be used is [\\x{00}-\\x{ff}], but with u flag it goes up to a 4-byte value \\x{7fffffff}


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2021-02-12 13:48
              
            
            
                                                                       

  So I can't match a letter like                                                                     

                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  没有蜡笔的小新        
                
              
                            
                2021-02-12 13:48
              
            
            
                                                                       
I'm not sure about php but there really is no governor on code points

so it doesn't matter that there are only some 1.1 million valid ones.

That is subject to change at any time, but its not really up to engines

to enforce that. There are reserved cp's that are holes in the valid range,

there are surrogates in the valid range, the reasons are endless for there

to be no other restriction other than the word size.  

For UTF-32, you can't go over 31 bits because 32 is the sign bit.

0x00000000 - 0x7FFFFFFF

Makes sense since unsigned int as a data type is the natural size of 32-bit hardware registers. 

For UTF-16, even truer you can see the same limitation masked to 16 bit.
Bit 32 is still the sign bit leaving  0x0000 - 0xFFFF as a valid range.  

Usually, if you use an engine that supports ICU you should be able to use it,

which converts both source and regex into UTF-32. Boost Regex is one such engine.

edit:   

Regarding UTF-16  

I guess when Unicode outgrew 16 bit, they punched a hole in the 16-bit range for surrogate pairs. But it left only 20 total bits between the pair as useable.  

10 bits in each surrogate with the other 6 used to determine hi or lo.

Looks like this left the Unicode folks with a limit of 20 bits + an extra 0xFFFF rounded, to a total of 0x10FFFF codepoints, with unusable holes.  

To be able to convert to a different encoding (8/16/32) all the codepoints

must actually be convertible. Thus the forever backward compatibile 20-bit is

the trap they ran into early, but now must live with.  

Regardless, regex engines won't be enforcing this limit anytime soon, probably never.

As far as surrogates, they are the hole, and an mal-formed literal surrogate can't be converted between modes. That just pertains to a literal encoded character during conversion, not a hex representation of one. For instance its easy to search a text in UTF-16 (only) mode for unpaired surrogates, or even paired one's.  

But I guess regex engines don't really care about holes or limits, they only care about what mode the subject string is in. No, the engine is not going to say:

'Hey wait, the mode is UTF-16 I better convert \x{210C1} to \x{D844}\x{DCC1}. Wait, if I did that, what do I do if its quantified \x{210C1}+,start injecting regex constructs around it? Worse yet, what if its in a class [\x{210C1}]? Nah.. better limit it to \x{FFFF}.

Some handy dandy, pseudo-code surrogate conversions I use:  

 Definitions:
 ====================
 10-bits
  3FF = 000000  1111111111

 Hi Surrogate
 D800 = 110110  0000000000
 DBFF = 110110  1111111111 

 Lo Surrogate
 DC00 = 110111  0000000000
 DFFF = 110111  1111111111


 Conversions:
 ====================
 UTF-16 Surrogates to UTF-32
 if ( TESTFOR_SURROGATE_PAIR(hi,lo) )
 {
    u32Out = 0x10000 + (  ((hi & 0x3FF) << 10) | (lo & 0x3FF)  );
 }

 UTF-32 to UTF-16 Surrogates
 if ( u32In >= 0x10000)
 {
    u32In -= 0x10000;
    hi = (0xD800 + ((u32In & 0xFFC00) >> 10));
    lo = (0xDC00 + (u32In & 0x3FF));
 }

 Macro's:
 ====================
 #define TESTFOR_SURROGATE_HI(hs) (((hs & 0xFC00)) == 0xD800 )
 #define TESTFOR_SURROGATE_LO(ls) (((ls & 0xFC00)) == 0xDC00 )
 #define TESTFOR_SURROGATE_PAIR(hs,ls) ( (((hs & 0xFC00)) == 0xD800) && (((ls & 0xFC00)) == 0xDC00) )
 //
 #define PTR_TESTFOR_SURROGATE_HI(ptr) (((*ptr & 0xFC00)) == 0xD800 )
 #define PTR_TESTFOR_SURROGATE_LO(ptr) (((*ptr & 0xFC00)) == 0xDC00 )
 #define PTR_TESTFOR_SURROGATE_PAIR(ptr) ( (((*ptr & 0xFC00)) == 0xD800) && (((*(ptr+1) & 0xFC00)) == 0xDC00) )

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2021-02-12 13:52
              
            
            
                                                                       
As minitech suggests in the first comment, you have to use the codepoint - for this character, it's \x{210C1}. That's also the encoded form in UTF-32.
F0 AF AB BF is the UTF-8 encoded sequence (see http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=210C1).

There are some versions of PCRE where you can use values up to \x{7FFFFFFF}. But I really don't know what could be matched with it.

To quote http://www.pcre.org/pcre.txt:


  In  UTF-16  mode,  the  character  code  is  Unicode, in the range 0 to
  0x10ffff, with the exception of values in the range 0xd800  to  0xdfff
  because  those  are "surrogate" values that are used in pairs to encode
  values greater than 0xffff.
  

[...]


  In  UTF-32  mode,  the  character  code  is  Unicode, in the range 0 to
  0x10ffff, with the exception of values in the range  0xd800  to  0xdfff
  because those are "surrogate" values that are ill-formed in UTF-32.


0x10ffff is the largest value you can use to match a character (that's what I extract from this). 0x10ffff is currently also the largest code point defined in the unicode standard (see What are some of the differences between the UTFs?) - thus every value above does not make any sense (or I just don't get it)...
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  陌清茗        
                
              
                            
                2021-02-12 13:58
              
            
            
                                                                       
"but want to know about the max hex boundary in a regex":
* in all utf modes:  0x10ffff
* native 8-bt mode: 0xff
* native 16-bit mode: 0xffff
* native 32-bit mode: 0x1fffffff
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2021-02-12 14:08
              
            
            
                                                                       
Unicode and UTF-8, UTF-16, UTF-32 encoding
Unicode is a character set, which specifies a mapping from characters to code points, and the character encodings (UTF-8, UTF-16, UTF-32) specify how to store the Unicode code points.
In Unicode, a character maps to a single code point, but it can have different representation depending on how it is encoded.
I don't want to rehash this discussion all over again, so if you are still not clear about this, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Using the example in the question,                                                                     

                                                        

            

            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          

          	          
                             

        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复