I want to perform certain manipulations on a XML document with PHP using the DOM part of its standard library. As others have already discovered, one has to deal with decode
As hakre explained, the problem is that in PHP's DOM library, the behaviour of setting nodeValue w.r.t. entities depends on the class of the node, in particular DOMText
and DOMElement
differ in this regard.
To illustrate this, an example:
$doc = new DOMDocument();
$doc->formatOutput = True;
$doc->loadXML(' ');
$s = 'text &<<"\'&text;&text';
$root = $doc->documentElement;
$node = $doc->createElement('tag1', $s); #line 10
$root->appendChild($node);
$node = $doc->createElement('tag2');
$text = $doc->createTextNode($s);
$node->appendChild($text);
$root->appendChild($node);
$node = $doc->createElement('tag3');
$text = $doc->createCDATASection($s);
$node->appendChild($text);
$root->appendChild($node);
echo $doc->saveXML();
outputs
Warning: DOMDocument::createElement(): unterminated entity reference text in /tmp/DOMtest.php on line 10
text &<<"'&text;
text &<<"'&text;&text
In this particular case, it is appropriate to alter the nodeValue of DOMText
nodes. Combining hakre's two answers one gets a quite elegant solution.
$doc = new DOMDocument();
$doc->loadXML();
$xpath = new DOMXPath($doc);
$node_list = $xpath->query();
$visitTextNode = function (DOMText $node) {
$text = $node->textContent;
/*
do something with $text
*/
$node->nodeValue = $text;
};
foreach ($node_list as $node) {
if ($node->nodeType == XML_TEXT_NODE) {
$visitTextNode($node);
} else {
foreach ($node->childNodes as $child) {
if ($child->nodeType == XML_TEXT_NODE) {
$visitTextNode($child);
}
}
}
}