I have a regular expression that Im using in php:
$word_array = preg_split(
\'/(\\/|\\.|-|_|=|\\?|\\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|o
The magic of the split. My original assumption was technically not correct (albeit a solution easier to come to). So let's check your split pattern:
(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)
I re-arranged it a bit. The outer parenthesis is not necessary and I moved the single characters into a character class at the end:
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\/._=?&%+-]
That for some sorting upfront. Let's call this pattern the split pattern, s
in short and define it.
You want to match all parts that are not of those characters from the split-at pattern and at minimum three characters.
I could achieve this with the following pattern, including support of the correct split sequences and unicode support.
$pattern = '/
(?(DEFINE)
(? # define subpattern which is the split pattern
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
[\\/._=?&%+-] # a little bit optimized with a character class
)
)
(?:(?&s)) # consume the subpattern (URL starts with \/)
\K # capture starts here
(?:(?!(?&s)).){3,} # ensure this is not the skip pattern, take 3 characters minimum
/ux';
Or in smaller:
$path = '/2009/06/pagerank-update.htmltesthtmltest%C3%A4shtml';
$subject = urldecode($path);
$pattern = '/(?(DEFINE)(?html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\\/._=?&%+-]))(?:(?&s))\K(?:(?!(?&s)).){3,}/u';
$word_array = preg_match_all($pattern, $subject, $m) ? $m[0] : [];
print_r($word_array);
Result:
Array
(
[0] => 2009
[1] => pagerank
[2] => update
[3] => test
[4] => testä
)
The same principle can be used with preg_split
as well. It's a little bit different:
$pattern = '/
(?(DEFINE) # define subpattern which is the split pattern
(?
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
[\/._=?&%+-]
)
)
(?:(?!(?&s)).){3,}(*SKIP)(*FAIL) # three or more is okay
|(?:(?!(?&s)).){1,2}(*SKIP)(*ACCEPT) # two or one is none
|(?&s) # split @ split, at least
/ux';
Usage:
$word_array = preg_split($pattern, $subject, 0, PREG_SPLIT_NO_EMPTY);
Result:
Array
(
[0] => 2009
[1] => pagerank
[2] => update
[3] => test
[4] => testä
)
These routines work as asked for. But this does have its price with performance. The cost is similar to the old answer.
Related questions:
Old answer, doing a two-step processing (first splitting, then filtering)
Because you are using a split routine, it will split - regardless of the length.
So what you can do is to filter the result. You can do that again with a regular expression (preg_filter), for example one that is dropping everything smaller three characters:
$word_array = preg_filter(
'/^.{3,}$/', '$0',
preg_split(
'/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
urldecode($path),
NULL,
PREG_SPLIT_NO_EMPTY
)
);
Result:
Array
(
[0] => 2009
[2] => pagerank
[3] => update
)