i need to parse a search string for keywords and phrases in php, for example
string 1: value of \"measured response\" detect goal \"method valuation\" study
There is no need to use a regular expression, the built in function str_getcsv
can be used to explode a string with any given delimiter, enclosure and escape characters.
Really it is as simple as.
// where $string is the string to parse
$array = str_getcsv($string, ' ', '"');
preg_match_all('/(?<!")\b\w+\b|(?<=")\b[^"]+/', $subject, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
# Matched text = $result[0][$i];
}
This should yield the results you are looking for.
Explanation :
# (?<!")\b\w+\b|(?<=")\b[^"]+
#
# Match either the regular expression below (attempting the next alternative only if this one fails) «(?<!")\b\w+\b»
# Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!")»
# Match the character “"” literally «"»
# Assert position at a word boundary «\b»
# Match a single character that is a “word character” (letters, digits, etc.) «\w+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert position at a word boundary «\b»
# Or match regular expression number 2 below (the entire match attempt fails if this one fails to match) «(?<=")\b[^"]+»
# Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=")»
# Match the character “"” literally «"»
# Assert position at a word boundary «\b»
# Match any character that is NOT a “"” «[^"]+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
$s = 'value of "measured response" detect goal "method valuation" study';
preg_match_all('~(?|"([^"]+)"|(\S+))~', $s, $matches);
print_r($matches[1]);
output:
Array
(
[0] => value
[1] => of
[2] => measured response
[3] => detect
[4] => goal
[5] => method valuation
[6] => study
)
The trick here is to use a branch-reset group: (?|...|...)
. It's just like an alternation contained in a non-capturing group - (?:...|...)
- except that within each branch the capturing-group numbers start at the same number. (For more info, see the PCRE docs and search for DUPLICATE SUBPATTERN NUMBERS
.)
Thus, the text we're interested in is always captured group #1. You can retrieve the contents of group #1 for all matches via $matches[1]
. (That's assuming the PREG_PATTERN_ORDER flag is set; I didn't specify it like @FailedDev did because it's the default. See the PHP docs for details.)