PHP: split string on comma, but NOT when between braces or quotes?

前端 未结 2 1094
梦毁少年i
梦毁少年i 2020-12-03 22:45

In PHP I have the following string :

$str = \"AAA, BBB, (CCC,DDD), \'EEE\', \'FFF,GGG\', (\'HHH\',\'III\'), ((\'JJJ\',\'KKK\'), LLL, (MMM,NNN)) , OOO\"; 


        
相关标签:
2条回答
  • 2020-12-03 23:15

    Instead of a preg_split, do a preg_match_all:

    $str = "AAA, BBB, (CCC,DDD), 'EEE', 'FFF,GGG', ('HHH','III'), (('JJJ','KKK'), LLL, (MMM,NNN)) , OOO"; 
    
    preg_match_all("/\((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+/", $str, $matches);
    
    print_r($matches);
    

    will print:

    Array
    (
        [0] => Array
            (
                [0] => AAA
                [1] => BBB
                [2] => (CCC,DDD)
                [3] => 'EEE'
                [4] => 'FFF,GGG'
                [5] => ('HHH','III')
                [6] => (('JJJ','KKK'), LLL, (MMM,NNN))
                [7] => OOO
            )
    
    )

    The regex \((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+ can be divided in three parts:

    1. \((?:[^()]|(?R))+\), which matches balanced pairs of parenthesis
    2. '[^']*' matching a quoted string
    3. [^(),\s]+ which matches any char-sequence not consisting of '(', ')', ',' or white-space chars
    0 讨论(0)
  • 2020-12-03 23:37

    Crazy solution

    A spartan regex that tokenizes and also validates all the tokens that it extracts:

    \G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()',\s]++\s*+(?(?!\)),)|\s*+'[^'\r\n]*+'\s*+(?(?!\)),))++\))|[^()',\s]++|'[^'\r\n]*+')\s*+(?:,|$)
    

    Regex101

    Put it in string literal, with delimiter:

    '/\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()\',\s]++\s*+(?(?!\)),)|\s*+\'[^\'\r\n]*+\'\s*+(?(?!\)),))++\))|[^()\',\s]++|\'[^\'\r\n]*+\')\s*+(?:,|$)/'
    

    ideone

    The result is in capturing group 1. In the example on ideone, I specify PREG_OFFSET_CAPTURE flag, so that you can check against the last match in group 0 (entire match) whether the entire source string has been consumed or not.

    Assumptions

    • Non-quoted text may not contain any whitespace character, as defined by \s. Consequently, it may not span multiple lines.
    • Non-quoted text may not contain (, ), ' or ,.
    • Non-quoted text must contain at least 1 character.
    • Single quoted text may not span multiple lines.
    • Single quoted text may not contain quote. Consequently, there is no way to specify '.
    • Single quoted text may be empty.
    • Bracket token contains one or more of the following as sub-tokens: non-quoted text token, single quoted text token, or another bracket token.
    • In bracket token, 2 adjacent sub-tokens are separated by exactly one ,
    • Bracket token starts with ( and ends with ).
    • Consequently, a bracket token must have balanced brackets, and empty bracket () is not allowed.
    • Input will contain one or more of: non-quoted text, single quoted text or bracket token. The tokens in the input are separated with comma ,. Single trailing comma , is considered valid.
    • Whitespace character (as defined by \s, which includes new line character) are arbitrarily allowed between token(s), comma(s) , separating tokens, and the bracket(s) (, ) of the bracket tokens.

    Breakdown

    \G\s*+
    (
      (
        \(
        (?:
            \s*+
            (?2)
            \s*+
            (?(?!\)),)
          |
            \s*+
            [^()',\s]++
            \s*+
            (?(?!\)),)
          |
            \s*+
            '[^'\r\n]*+'
            \s*+
            (?(?!\)),)
        )++
        \)
      )
      |
      [^()',\s]++
      |
      '[^'\r\n]*+'
    )
    \s*+(?:,|$)
    
    0 讨论(0)
提交回复
热议问题