I\'ve been trying to \"parse\" some data using a regex, and I feel as if I\'m close, but I just can\'t seem to bring it all home.
The data that needs parsing gener
I think I'd avoid using regex to do this task, instead split it into sub-tasks.
\n
using explode
:
also using explode
with a limit of 2.This algorithm does assume there are no keys with escaped colons. Escaped colons in values will be dealt with just fine (i.e. user input).
$str = <<<EOT
FooID: 123456
Name: Chuck
When: 01/02/2013 01:23:45
InternalID:
User Message: Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
This\: works too. And can start with any number of \\n's. It can be empty, too.
What's worse, though is that this CAN contain colons (but they're _"escaped"_
using `\`) like so `\:`, and even basic markup!
EOT;
$arr = explode("\n", $str);
$prevKey = '';
$split = ': ';
$output = array();
for ($i = 0, $arrlen = sizeof($arr); $i < $arrlen; $i++) {
$keyValuePair = explode($split, $arr[$i], 2);
// ?: Is this a valid key/value pair
if (sizeof($keyValuePair) < 2 && $i > 0) {
// -> Nope, append the value to the previous key's value
$output[$prevKey] .= "\n" . $keyValuePair[0];
}
else {
// -> Maybe
// ?: Did we miss an escaped colon
if (substr($keyValuePair[0], -1) === '\\') {
// -> Yep, this means this is a value, not a key/value pair append both key and
// value (including the split between) to the previous key's value ignoring
// any colons in the rest of the string (allowing dates to pass through)
$output[$prevKey] .= "\n" . $keyValuePair[0] . $split . $keyValuePair[1];
}
else {
// -> Nope, create a new key with a value
$output[$keyValuePair[0]] = $keyValuePair[1];
$prevKey = $keyValuePair[0];
}
}
}
var_dump($output);
array(5) {
["FooID"]=>
string(6) "123456"
["Name"]=>
string(5) "Chuck"
["When"]=>
string(19) "01/02/2013 01:23:45"
["InternalID"]=>
string(0) ""
["User Message"]=>
string(293) "Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
This\: works too. And can start with any number of \n's. It can be empty, too.
What's worse, though is that this CAN contain colons (but they're _"escaped"_
using `\`) like so `\:`, and even basic markup!"
}
Online demo
The following regex should work, but I'm not so sure anymore if it is the right tool for this:
preg_match_all(
'%^ # Start of line
([^:]*) # Match anything until a colon, capture in group 1
:\s* # Match a colon plus optional whitespace
( # Match and capture in group 2:
(?: # Start of non-capturing group (used for alternation)
.*$ # Either match the rest of the line
(?= # only if one of the following follows here:
\Z # The end of the string
| # or
\r?\n # a newline
[^:\n\\\\]* # followed by anything except colon, backslash or newline
: # then a colon
) # End of lookahead
| # or match
(?: # Start of non-capturing group (used for alternation/repetition)
[^:\\\\] # Either match a character except colon or backslash
| # or
\\\\. # match any escaped character
)* # Repeat as needed (end of inner non-capturing group)
) # End of outer non-capturing group
) # End of capturing group 2
$ # Match the end of the line%mx',
$subject, $result, PREG_PATTERN_ORDER);
See it live on regex101.
So here's what I came up with using a tricky preg_replace_callback():
$string ='FooID: 123456
Name: Chuck
When: 01/02/2013 01:23:45
InternalID: 789654
User Message: Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
And can start with any number of \n\'s. It can be empty, too
Yellow:cool';
$array = array();
preg_replace_callback('#^(.*?):(.*)|.*$#m', function($m)use(&$array){
static $last_key = ''; // We are going to use this as a reference
if(isset($m[1])){// If there is a normal match (key : value)
$array[$m[1]] = $m[2]; // Then add to array
$last_key = $m[1]; // define the new last key
}else{ // else
$array[$last_key] .= PHP_EOL . $m[0]; // add the whole line to the last entry
}
}, $string); // Anonymous function used thus PHP 5.3+ is required
print_r($array); // print
Online demo
Downside: I'm using PHP_EOL
to add newlines which is OS related.
i'm pretty new to PHP so maybe this is totally out of whack, but maybe you could use something like
$data = <<<EOT
FooID: 123456
Name: Chuck
When: 01/02/2013 01:23:45
InternalID: 789654
User Message: Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
And can start with any number of \n's. It can be empty, too
EOT;
if ($key = preg_match_all('~^[^:\n]+?:~m', $data, $match)) {
$val = explode('¬', preg_filter('~^[^:\n]+?:~m', '¬', $data));
array_shift($val);
$res = array_combine($match[0], $val);
}
print_r($res);
yields
Array
(
[FooID:] => 123456
[Name:] => Chuck
[When:] => 01/02/2013 01:23:45
[InternalID:] => 789654
[User Message:] => Hello,
this is nillable, but can be quite long. Text can be spread out over many lines
And can start with any number of
's. It can be empty, too
)