I have a PHP regEx, how do add a condition for the number of characters?

前端 未结 5 1755
臣服心动
臣服心动 2021-01-27 02:00

I have a regular expression that Im using in php:

$word_array = preg_split(
    \'/(\\/|\\.|-|_|=|\\?|\\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|o         


        
5条回答
  •  野的像风
    2021-01-27 02:06

    The magic of the split. My original assumption was technically not correct (albeit a solution easier to come to). So let's check your split pattern:

    (\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)
    

    I re-arranged it a bit. The outer parenthesis is not necessary and I moved the single characters into a character class at the end:

     html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\/._=?&%+-]
    

    That for some sorting upfront. Let's call this pattern the split pattern, s in short and define it.

    You want to match all parts that are not of those characters from the split-at pattern and at minimum three characters.

    I could achieve this with the following pattern, including support of the correct split sequences and unicode support.

    $pattern    = '/
        (?(DEFINE)
            (? # define subpattern which is the split pattern
                html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
                [\\/._=?&%+-] # a little bit optimized with a character class
            )
        )
        (?:(?&s))          # consume the subpattern (URL starts with \/)
        \K                 # capture starts here
        (?:(?!(?&s)).){3,} # ensure this is not the skip pattern, take 3 characters minimum
    /ux';
    

    Or in smaller:

    $path       = '/2009/06/pagerank-update.htmltesthtmltest%C3%A4shtml';
    $subject    = urldecode($path);
    $pattern    = '/(?(DEFINE)(?html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\\/._=?&%+-]))(?:(?&s))\K(?:(?!(?&s)).){3,}/u';
    $word_array = preg_match_all($pattern, $subject, $m) ? $m[0] : [];
    print_r($word_array);
    

    Result:

    Array
    (
        [0] => 2009
        [1] => pagerank
        [2] => update
        [3] => test
        [4] => testä
    )
    

    The same principle can be used with preg_split as well. It's a little bit different:

    $pattern = '/
        (?(DEFINE)       # define subpattern which is the split pattern
            (?
        html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
        [\/._=?&%+-]
            )
        )
        (?:(?!(?&s)).){3,}(*SKIP)(*FAIL)       # three or more is okay
        |(?:(?!(?&s)).){1,2}(*SKIP)(*ACCEPT)   # two or one is none
        |(?&s)                                 # split @ split, at least
    /ux';
    

    Usage:

    $word_array = preg_split($pattern, $subject, 0, PREG_SPLIT_NO_EMPTY);
    

    Result:

    Array
    (
        [0] => 2009
        [1] => pagerank
        [2] => update
        [3] => test
        [4] => testä
    )
    

    These routines work as asked for. But this does have its price with performance. The cost is similar to the old answer.

    Related questions:

    • Antimatch with Regex
    • Split string by delimiter, but not if it is escaped

    Old answer, doing a two-step processing (first splitting, then filtering)

    Because you are using a split routine, it will split - regardless of the length.

    So what you can do is to filter the result. You can do that again with a regular expression (preg_filter), for example one that is dropping everything smaller three characters:

    $word_array = preg_filter(
        '/^.{3,}$/', '$0', 
        preg_split(
            '/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
            urldecode($path), 
            NULL, 
            PREG_SPLIT_NO_EMPTY
        )
    );
    

    Result:

    Array
    (
        [0] => 2009
        [2] => pagerank
        [3] => update
    )
    

提交回复
热议问题