There is a database of bot information that I would like to parse. It is said to be similar to RFC822 messages.
Before I re-invent the
The message MIME type is pretty common. Parsers exist plenty, but are commonly hard to google. Personally I resort to regex here, if the format is somewhat consistent.
For example these two will do the trick:
// matches a consecutive RFC821 style key:value list
define("RX_RFC821_BLOCK", b"/(?:^\w[\w.-]*\w:.*\R(?:^[ \t].*\R)*)++\R*/m");
// break up Key: value lines
define("RX_RFC821_SPLIT", b"/^(\w+(?:[-.]?\w+)*)\s*:\s*(.*\n(?:^[ \t].*\n)*)/m");
Number one breaks out coherent blocks of message/* lines, and the second can be used to split up each such block. It needs post-processing to strip leading indendation from continued value lines though.
Assuming that $data
contains the sample data you pasted above, here is the parser:
<?php
/*
* $data = <<<'DATA'
* <put-sample-data-here>
* DATA;
*
*/
$parsed = array();
$blocks = preg_split('/\n\n/', $data);
$lines = array();
$matches = array();
foreach ($blocks as $i => $block) {
$parsed[$i] = array();
$lines = preg_split('/\n(([\w.-]+)\: *((.*\n\s+.+)+|(.*(?:\n))|(.*))?)/',
$block, -1, PREG_SPLIT_DELIM_CAPTURE);
foreach ($lines as $line) {
if(preg_match('/^\n?([\w.-]+)\: *((.*\n\s+.+)+|(.*(?:\n))|(.*))?$/',
$line, $matches)) {
$parsed[$i][$matches[1]] = preg_replace('/\n +/', ' ',
trim($matches[2]));
}
}
}
print_r($parsed);