parse invalid XML manually

后端 未结 1 390
孤城傲影
孤城傲影 2020-12-20 09:41

I­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­

相关标签:
1条回答
  • 2020-12-20 10:10

    DOMDocument::loadHTML method is more lenient than the XML parser and is able to automatically fix many errors. The problem is that you have no control on how libxml will fix these errors.

    That's why I suggest an other approach with DOMDocument::loadXML (that uses the XML parser), but this time I will try to correct errors with custom rules (that aren't universal fixes but are adapted to the specific situation)

    When you switch libxml_use_internal_errors() to true, all xml errors are stored in an array of libXMLErr instances. Each of them contains an error code, the error line and the error column. (Note that the first line and the first column are 1).

    $xml = file_get_contents('file.xml');
    
    $dom = new DOMDocument;
    libxml_use_internal_errors(true);
    $dom->loadXML($xml);
    $errors = libxml_get_errors();
    
    if ($errors) {
        // LIBXML constant name, LIBXML error code // LIBXML error message
        define('XML_ERR_LT_IN_ATTRIBUTE', 38); // Unescaped '<' not allowed in attributes values
        define('XML_ERR_ATTRIBUTE_WITHOUT_VALUE', 41); // Specification mandate value for attribute
        define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name
    
        $rules = [
            XML_ERR_LT_IN_ATTRIBUTE => [
                'pattern' => '~(?:(?!\A)|.{%d}")[^<"]*\K<~A',
                'replacement' => [ 'string' => '&lt;', 'size' => 3 ]
            ],
            XML_ERR_ATTRIBUTE_WITHOUT_VALUE => [
                'pattern' => '~^.{%d}\h+\w+\h*=\h*"[^"]*\K"([^"]*)"~',
                'replacement' => [ 'string' => '&quot;$1&quot;', 'size' => 10 ]
            ],
            XML_ERR_NAME_REQUIRED => [
                'pattern' => '~^.{%d}[^&]*\K&~',
                'replacement' => [ 'string' => '&amp;', 'size' => 4 ]
            ]
        ];
    
        $previousLineNo = 0;
        $lines = explode("\n", $xml);
    
        foreach ($errors as $error) {
    
            if (!isset($rules[$error->code])) continue;
    
            $currentLineNo = $error->line;
    
            if ( $currentLineNo != $previousLineNo )
                $offset = -1;
    
            $currentLine = &$lines[$currentLineNo - 1];
            $pattern = sprintf($rules[$error->code]['pattern'], $error->column + $offset);
            $currentLine = preg_replace($pattern,
                                        $rules[$error->code]['replacement']['string'],
                                        $currentLine, -1, $count);
            $offset += $rules[$error->code]['replacement']['size'] * $count;
            $previousLineNo = $currentLineNo;
        }
    
        $xml = implode("\n", $lines);
    
        libxml_clear_errors();
        $dom->loadXML($xml);
        $errors = libxml_get_errors();
    }
    
    var_dump($errors);
    
    $s = simplexml_import_dom($dom);
    
    echo $s->product[0]["name"];
    

    The size in the rules array is the difference between the size of the replacement string and the size of the replaced string. This way when there are several errors on the same line, the position of the next error is updated with $offset.

    libxml error constants are not available in PHP, this is the reason why they are manually defined (only to make the code more readable). You can find them here.

    0 讨论(0)
提交回复
热议问题