I
DOMDocument::loadHTML
method is more lenient than the XML parser and is able to automatically fix many errors. The problem is that you have no control on how libxml will fix these errors.
That's why I suggest an other approach with DOMDocument::loadXML
(that uses the XML parser), but this time I will try to correct errors with custom rules (that aren't universal fixes but are adapted to the specific situation)
When you switch libxml_use_internal_errors()
to true
, all xml errors are stored in an array of libXMLErr
instances. Each of them contains an error code, the error line and the error column. (Note that the first line and the first column are 1).
$xml = file_get_contents('file.xml');
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadXML($xml);
$errors = libxml_get_errors();
if ($errors) {
// LIBXML constant name, LIBXML error code // LIBXML error message
define('XML_ERR_LT_IN_ATTRIBUTE', 38); // Unescaped '<' not allowed in attributes values
define('XML_ERR_ATTRIBUTE_WITHOUT_VALUE', 41); // Specification mandate value for attribute
define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name
$rules = [
XML_ERR_LT_IN_ATTRIBUTE => [
'pattern' => '~(?:(?!\A)|.{%d}")[^<"]*\K<~A',
'replacement' => [ 'string' => '<', 'size' => 3 ]
],
XML_ERR_ATTRIBUTE_WITHOUT_VALUE => [
'pattern' => '~^.{%d}\h+\w+\h*=\h*"[^"]*\K"([^"]*)"~',
'replacement' => [ 'string' => '"$1"', 'size' => 10 ]
],
XML_ERR_NAME_REQUIRED => [
'pattern' => '~^.{%d}[^&]*\K&~',
'replacement' => [ 'string' => '&', 'size' => 4 ]
]
];
$previousLineNo = 0;
$lines = explode("\n", $xml);
foreach ($errors as $error) {
if (!isset($rules[$error->code])) continue;
$currentLineNo = $error->line;
if ( $currentLineNo != $previousLineNo )
$offset = -1;
$currentLine = &$lines[$currentLineNo - 1];
$pattern = sprintf($rules[$error->code]['pattern'], $error->column + $offset);
$currentLine = preg_replace($pattern,
$rules[$error->code]['replacement']['string'],
$currentLine, -1, $count);
$offset += $rules[$error->code]['replacement']['size'] * $count;
$previousLineNo = $currentLineNo;
}
$xml = implode("\n", $lines);
libxml_clear_errors();
$dom->loadXML($xml);
$errors = libxml_get_errors();
}
var_dump($errors);
$s = simplexml_import_dom($dom);
echo $s->product[0]["name"];
The size
in the rules array is the difference between the size of the replacement string and the size of the replaced string. This way when there are several errors on the same line, the position of the next error is updated with $offset
.
libxml error constants are not available in PHP, this is the reason why they are manually defined (only to make the code more readable). You can find them here.