Regex with possible empty matches and multi-line match

后端 未结 4 1086
猫巷女王i
猫巷女王i 2021-01-23 22:28

I\'ve been trying to \"parse\" some data using a regex, and I feel as if I\'m close, but I just can\'t seem to bring it all home.
The data that needs parsing gener

相关标签:
4条回答
  • 2021-01-23 22:44

    I think I'd avoid using regex to do this task, instead split it into sub-tasks.

    Basic algorithm outline

    1. Split the string on \n using explode
    2. Loop over the resulting array
      1. Split the resulting strings on : also using explode with a limit of 2.
      2. If the produced array's length is less than 2, add the entirety of the data to the previous key's value
      3. Else, use the first array index as your key, the second as the value unless the split colon was escaped (in which case, instead add the key + split + value to the previous key's value)

    This algorithm does assume there are no keys with escaped colons. Escaped colons in values will be dealt with just fine (i.e. user input).

    Code

    $str = <<<EOT
    FooID: 123456
    Name: Chuck
    When: 01/02/2013 01:23:45
    InternalID: 
    User Message: Hello,
    this is nillable, but can be quite long. Text can be spread out over many lines
    This\: works too. And can start with any number of \\n's. It can be empty, too.
    What's worse, though is that this CAN contain colons (but they're _"escaped"_
    
    
    using `\`) like so `\:`, and even basic markup!
    EOT;
    
    $arr = explode("\n", $str);
    
    $prevKey = '';
    $split = ': ';
    $output = array();
    for ($i = 0, $arrlen = sizeof($arr); $i < $arrlen; $i++) {
      $keyValuePair = explode($split, $arr[$i], 2);
      // ?: Is this a valid key/value pair
      if (sizeof($keyValuePair) < 2 && $i > 0) {
        // -> Nope, append the value to the previous key's value
        $output[$prevKey] .= "\n" . $keyValuePair[0];
      }
      else {
        // -> Maybe
        // ?: Did we miss an escaped colon
        if (substr($keyValuePair[0], -1) === '\\') {
          // -> Yep, this means this is a value, not a key/value pair append both key and
          // value (including the split between) to the previous key's value ignoring
          // any colons in the rest of the string (allowing dates to pass through)
          $output[$prevKey] .= "\n" . $keyValuePair[0] . $split . $keyValuePair[1];
        }
        else {
          // -> Nope, create a new key with a value
          $output[$keyValuePair[0]] = $keyValuePair[1];
          $prevKey = $keyValuePair[0];
        }
      }
    }
    
    var_dump($output);
    

    Output

    array(5) {
      ["FooID"]=>
      string(6) "123456"
      ["Name"]=>
      string(5) "Chuck"
      ["When"]=>
      string(19) "01/02/2013 01:23:45"
      ["InternalID"]=>
      string(0) ""
      ["User Message"]=>
      string(293) "Hello,
    this is nillable, but can be quite long. Text can be spread out over many lines
    This\: works too. And can start with any number of \n's. It can be empty, too.
    What's worse, though is that this CAN contain colons (but they're _"escaped"_
    
    
    using `\`) like so `\:`, and even basic markup!"
    }
    

    Online demo

    0 讨论(0)
  • 2021-01-23 22:51

    The following regex should work, but I'm not so sure anymore if it is the right tool for this:

    preg_match_all(
        '%^            # Start of line
        ([^:]*)        # Match anything until a colon, capture in group 1
        :\s*           # Match a colon plus optional whitespace
        (              # Match and capture in group 2:
         (?:           # Start of non-capturing group (used for alternation)
          .*$          #  Either match the rest of the line
          (?=          #  only if one of the following follows here:
           \Z          #  The end of the string
          |            #  or
           \r?\n       #  a newline
           [^:\n\\\\]* #  followed by anything except colon, backslash or newline
           :           #  then a colon
          )            #  End of lookahead
         |             # or match
          (?:          #  Start of non-capturing group (used for alternation/repetition)
           [^:\\\\]    #  Either match a character except colon or backslash
          |            #  or
           \\\\.       #  match any escaped character
          )*           #  Repeat as needed (end of inner non-capturing group)
         )             # End of outer non-capturing group
        )              # End of capturing group 2
        $              # Match the end of the line%mx', 
        $subject, $result, PREG_PATTERN_ORDER);
    

    See it live on regex101.

    0 讨论(0)
  • 2021-01-23 22:55

    So here's what I came up with using a tricky preg_replace_callback():

    $string ='FooID: 123456
    Name: Chuck
    When: 01/02/2013 01:23:45
    InternalID: 789654
    User Message: Hello,
    this is nillable, but can be quite long. Text can be spread out over many lines
    And can start with any number of \n\'s. It can be empty, too
    Yellow:cool';
    
    $array = array();
    preg_replace_callback('#^(.*?):(.*)|.*$#m', function($m)use(&$array){
        static $last_key = ''; // We are going to use this as a reference
        if(isset($m[1])){// If there is a normal match (key : value)
            $array[$m[1]] = $m[2]; // Then add to array
            $last_key = $m[1]; // define the new last key
        }else{ // else
            $array[$last_key] .= PHP_EOL . $m[0]; // add the whole line to the last entry
        }
    }, $string); // Anonymous function used thus PHP 5.3+ is required
    print_r($array); // print
    

    Online demo

    Downside: I'm using PHP_EOL to add newlines which is OS related.

    0 讨论(0)
  • 2021-01-23 23:01

    i'm pretty new to PHP so maybe this is totally out of whack, but maybe you could use something like

    $data = <<<EOT
    FooID: 123456
    Name: Chuck
    When: 01/02/2013 01:23:45
    InternalID: 789654
    User Message: Hello,
    this is nillable, but can be quite long. Text can be spread out over many     lines
    And can start with any number of \n's. It can be empty, too
    EOT;
    
    if ($key = preg_match_all('~^[^:\n]+?:~m', $data, $match)) {
        $val = explode('¬', preg_filter('~^[^:\n]+?:~m', '¬', $data));
    
        array_shift($val);
    
        $res = array_combine($match[0], $val);
    }
    
    print_r($res);
    

    yields

    Array
    (
        [FooID:] =>  123456
        [Name:] =>  Chuck
        [When:] =>  01/02/2013 01:23:45
        [InternalID:] =>  789654
        [User Message:] =>  Hello,
    this is nillable, but can be quite long. Text can be spread out over many     lines
    And can start with any number of 
    's. It can be empty, too
    )
    
    0 讨论(0)
提交回复
热议问题