I have a PHP regEx, how do add a condition for the number of characters?

前端未结

关注

 5  1760

I have a regular expression that Im using in php:

$word_array = preg_split(
    \'/(\\/|\\.|-|_|=|\\?|\\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|o


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  野的像风        
                
              
                            
                2021-01-27 02:06
              
            
            
                                                                       
The magic of the split. My original assumption was technically not correct (albeit a solution easier to come to). So let's check your split pattern:

(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)


I re-arranged it a bit. The outer parenthesis is not necessary and I moved the single characters into a character class at the end:

 html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\/._=?&%+-]


That for some sorting upfront. Let's call this pattern the split pattern, s in short and define it.

You want to match all parts that are not of those characters from the split-at pattern and at minimum three characters.

I could achieve this with the following pattern, including support of the correct split sequences and unicode support.

$pattern    = '/
    (?(DEFINE)
        (?<s> # define subpattern which is the split pattern
            html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
            [\\/._=?&%+-] # a little bit optimized with a character class
        )
    )
    (?:(?&s))          # consume the subpattern (URL starts with \/)
    \K                 # capture starts here
    (?:(?!(?&s)).){3,} # ensure this is not the skip pattern, take 3 characters minimum
/ux';


Or in smaller:

$path       = '/2009/06/pagerank-update.htmltesthtmltest%C3%A4shtml';
$subject    = urldecode($path);
$pattern    = '/(?(DEFINE)(?<s>html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\\/._=?&%+-]))(?:(?&s))\K(?:(?!(?&s)).){3,}/u';
$word_array = preg_match_all($pattern, $subject, $m) ? $m[0] : [];
print_r($word_array);


Result:

Array
(
    [0] => 2009
    [1] => pagerank
    [2] => update
    [3] => test
    [4] => testä
)


The same principle can be used with preg_split as well. It's a little bit different:

$pattern = '/
    (?(DEFINE)       # define subpattern which is the split pattern
        (?<s>
    html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
    [\/._=?&%+-]
        )
    )
    (?:(?!(?&s)).){3,}(*SKIP)(*FAIL)       # three or more is okay
    |(?:(?!(?&s)).){1,2}(*SKIP)(*ACCEPT)   # two or one is none
    |(?&s)                                 # split @ split, at least
/ux';


Usage:

$word_array = preg_split($pattern, $subject, 0, PREG_SPLIT_NO_EMPTY);


Result:

Array
(
    [0] => 2009
    [1] => pagerank
    [2] => update
    [3] => test
    [4] => testä
)


These routines work as asked for. But this does have its price with performance. The cost is similar to the old answer.

Related questions:


Antimatch with Regex
Split string by delimiter, but not if it is escaped





  Old answer, doing a two-step processing (first splitting, then filtering)


Because you are using a split routine, it will split - regardless of the length.

So what you can do is to filter the result. You can do that again with a regular expression (preg_filter), for example one that is dropping everything smaller three characters:

$word_array = preg_filter(
    '/^.{3,}$/', '$0', 
    preg_split(
        '/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
        urldecode($path), 
        NULL, 
        PREG_SPLIT_NO_EMPTY
    )
);


Result:

Array
(
    [0] => 2009
    [2] => pagerank
    [3] => update
)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  臣服心动        
                
              
                            
                2021-01-27 02:20
              
            
            
                                                                       
I would think that if you were trying to derive meaning from the URL's that you would actually want to write clean URL's in such a way that you don't need a complex regex to derive the value.

In many cases this involves using server redirect rules and a front controller or request router.

So what you build are clean URL's like

/value1/value2/value3


Without any .html,.php, etc. in the URL at all.

It seems to me that you are not addressing the problem at the point of entry into the system (i.e the web server) adequately so as to make your URL parsing as simple as it should be.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2021-01-27 02:22
              
            
            
                                                                       
I'm guessing you're building a URL router of some kind.

Detecting which parameters are useful and which are not should not be part of this code. It may vary per page whether a short parameter is relevant.

In this case, couldn't you just ignore the 1'th element? Your page should (or 'handler') should have knowledge over which parameters it wants to be called with, it should do the triage.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2021-01-27 02:25
              
            
            
                                                                       
Don't use a regex to break apart that path.  Just use explode.

$dirs = explode( '/', urldecode($path) );


Then, if you need to break apart an individual element of the array, do that, like on your "pagerank-update" element at the end.

EDIT:

The key is that you have two different problems.  First you want to break apart the path elements on slashes.  Then, you want to break up the filename into smaller parts.  Don't try to cram everything into one regex that tries to do everything.  

Three discrete steps:


$dirs = explode...
Weed out arguments < 3 chars
Break up file argument at the end


It is far clearer if you break up your logic into discrete logical chunks rather than trying to make the regex do everything.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2021-01-27 02:26
              
            
            
                                                                       
How about trying preg_match() instead of preg_split()?

The pattern (using the Assertions):

/([a-z0-9]{3,})(?<!htm|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org)/iu


The function call:

$pattern = '/([a-z0-9]{3,})(?<!htm|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org)/iu';
$subject = '/2009/06/pagerank-update.html';
preg_match_all($pattern, $subject, $matches);
print_r($matches);


You can try the function here: functions-online.com/preg_match_all.html

Hope this helps
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复