In PHP I have the following string :
$str = \"AAA, BBB, (CCC,DDD), \'EEE\', \'FFF,GGG\', (\'HHH\',\'III\'), ((\'JJJ\',\'KKK\'), LLL, (MMM,NNN)) , OOO\";
Instead of a preg_split
, do a preg_match_all
:
$str = "AAA, BBB, (CCC,DDD), 'EEE', 'FFF,GGG', ('HHH','III'), (('JJJ','KKK'), LLL, (MMM,NNN)) , OOO";
preg_match_all("/\((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+/", $str, $matches);
print_r($matches);
will print:
Array ( [0] => Array ( [0] => AAA [1] => BBB [2] => (CCC,DDD) [3] => 'EEE' [4] => 'FFF,GGG' [5] => ('HHH','III') [6] => (('JJJ','KKK'), LLL, (MMM,NNN)) [7] => OOO ) )
The regex \((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+
can be divided in three parts:
\((?:[^()]|(?R))+\)
, which matches balanced pairs of parenthesis'[^']*'
matching a quoted string[^(),\s]+
which matches any char-sequence not consisting of '('
, ')'
, ','
or white-space charsA spartan regex that tokenizes and also validates all the tokens that it extracts:
\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()',\s]++\s*+(?(?!\)),)|\s*+'[^'\r\n]*+'\s*+(?(?!\)),))++\))|[^()',\s]++|'[^'\r\n]*+')\s*+(?:,|$)
Regex101
Put it in string literal, with delimiter:
'/\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()\',\s]++\s*+(?(?!\)),)|\s*+\'[^\'\r\n]*+\'\s*+(?(?!\)),))++\))|[^()\',\s]++|\'[^\'\r\n]*+\')\s*+(?:,|$)/'
ideone
The result is in capturing group 1. In the example on ideone, I specify PREG_OFFSET_CAPTURE
flag, so that you can check against the last match in group 0 (entire match) whether the entire source string has been consumed or not.
\s
. Consequently, it may not span multiple lines.(
, )
, '
or ,
.'
.,
(
and ends with )
.()
is not allowed.,
. Single trailing comma ,
is considered valid.\s
, which includes new line character) are arbitrarily allowed between token(s), comma(s) ,
separating tokens, and the bracket(s) (
, )
of the bracket tokens.\G\s*+ ( ( \( (?: \s*+ (?2) \s*+ (?(?!\)),) | \s*+ [^()',\s]++ \s*+ (?(?!\)),) | \s*+ '[^'\r\n]*+' \s*+ (?(?!\)),) )++ \) ) | [^()',\s]++ | '[^'\r\n]*+' ) \s*+(?:,|$)