Parsing e-mail-like headers (similar to RFC822)

前端 未结 2 521
猫巷女王i
猫巷女王i 2021-01-20 14:51

Problem / Question

There is a database of bot information that I would like to parse. It is said to be similar to RFC822 messages.

Before I re-invent the

相关标签:
2条回答
  • 2021-01-20 15:43

    The message MIME type is pretty common. Parsers exist plenty, but are commonly hard to google. Personally I resort to regex here, if the format is somewhat consistent.

    For example these two will do the trick:

      // matches a consecutive RFC821 style key:value list
    define("RX_RFC821_BLOCK", b"/(?:^\w[\w.-]*\w:.*\R(?:^[ \t].*\R)*)++\R*/m");
    
      // break up Key: value lines
    define("RX_RFC821_SPLIT", b"/^(\w+(?:[-.]?\w+)*)\s*:\s*(.*\n(?:^[ \t].*\n)*)/m");
    

    Number one breaks out coherent blocks of message/* lines, and the second can be used to split up each such block. It needs post-processing to strip leading indendation from continued value lines though.

    0 讨论(0)
  • 2021-01-20 15:47

    Assuming that $data contains the sample data you pasted above, here is the parser:

    <?php
    
    /* 
     * $data = <<<'DATA'
     * <put-sample-data-here>
     * DATA;
     *
     */
    
    $parsed  = array();
    $blocks  = preg_split('/\n\n/', $data);
    $lines   = array();
    $matches = array();
    foreach ($blocks as $i => $block) {
        $parsed[$i] = array();
        $lines = preg_split('/\n(([\w.-]+)\: *((.*\n\s+.+)+|(.*(?:\n))|(.*))?)/',
                            $block, -1, PREG_SPLIT_DELIM_CAPTURE);
        foreach ($lines as $line) {
            if(preg_match('/^\n?([\w.-]+)\: *((.*\n\s+.+)+|(.*(?:\n))|(.*))?$/',
                          $line, $matches)) {
                $parsed[$i][$matches[1]] = preg_replace('/\n +/', ' ',
                                                        trim($matches[2]));
            }
        }
    }
    
    print_r($parsed);
    
    0 讨论(0)
提交回复
热议问题