I want to perform certain manipulations on a XML document with PHP using the DOM part of its standard library. As others have already discovered, one has to deal with decode
As hakre explained, the problem is that in PHP's DOM library, the behaviour of setting nodeValue w.r.t. entities depends on the class of the node, in particular DOMText
and DOMElement
differ in this regard.
To illustrate this, an example:
$doc = new DOMDocument();
$doc->formatOutput = True;
$doc->loadXML('<root/>');
$s = 'text &<<"\'&text;&text';
$root = $doc->documentElement;
$node = $doc->createElement('tag1', $s); #line 10
$root->appendChild($node);
$node = $doc->createElement('tag2');
$text = $doc->createTextNode($s);
$node->appendChild($text);
$root->appendChild($node);
$node = $doc->createElement('tag3');
$text = $doc->createCDATASection($s);
$node->appendChild($text);
$root->appendChild($node);
echo $doc->saveXML();
outputs
Warning: DOMDocument::createElement(): unterminated entity reference text in /tmp/DOMtest.php on line 10
<?xml version="1.0"?>
<root>
<tag1>text &<<"'&text;</tag1>
<tag2>text &amp;&lt;<"'&text;&text</tag2>
<tag3><![CDATA[text &<<"'&text;&text]]></tag3>
</root>
In this particular case, it is appropriate to alter the nodeValue of DOMText
nodes. Combining hakre's two answers one gets a quite elegant solution.
$doc = new DOMDocument();
$doc->loadXML(<XML data>);
$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);
$visitTextNode = function (DOMText $node) {
$text = $node->textContent;
/*
do something with $text
*/
$node->nodeValue = $text;
};
foreach ($node_list as $node) {
if ($node->nodeType == XML_TEXT_NODE) {
$visitTextNode($node);
} else {
foreach ($node->childNodes as $child) {
if ($child->nodeType == XML_TEXT_NODE) {
$visitTextNode($child);
}
}
}
}
Your question is basically whether or not setting DOMText::nodeValue to an XML encoded string or to a verbatim string.
So let's just try that out and set it to &
and '&
and see what happens:
$doc = new DOMDocument();
$doc->loadXML('<root>*</root>');
$text = $doc->documentElement->childNodes->item(0);
echo "Before Edit: ", $doc->saveXML($text), "\n";
$text->nodeValue = "&";
echo "After Edit 1: ", $doc->saveXML($text), "\n";
$text->nodeValue = "&";
echo "After Edit 2: ", $doc->saveXML($text), "\n";
The output then is as the following (PHP 5.0.0 - 5.5.0):
Before Edit: *
After Edit 1: &
After Edit 2: &amp;
This shows that setting the nodeValue
of a DOMText
-node expects a UTF-8 encoded string and the DOM library encodes the XML reserved characters automatically.
So you should not apply htmlspecialchars()
onto any text you add this way. That would create a double-encoding.
As you write you experience the opposite I suggest you to execute an isolated PHP example on the commandline / within your IDE so that you can see exactly the output. Not that your browser renders this as HTML and then you think the reserved XML characters have not been encoded.
As you have pointed out you're not editing a DOMText
but an DOMElement
node. It works a bit different, here the &
character needs to be passed as entity &
instead of verbatim , however only this character.
So this needs a little bit more work:
DOMText
node. Everything will be perfectly encoded.DOMText
node form first step as child.And done. Here your inner foreach modified showing this:
foreach($node_list as $node) {
$text = $doc->createTextNode($node->textContent);
$node->nodeValue = "";
$node->appendChild($text);
}
For your concrete example albeit I must admit I don't understand why you do that because this does not change the value so it wouldn't need this.
Tip: In PHP DOMDocument can open this feed directly, you don't need curl here:
$doc = new DOMDocument(); $doc->load("http://feeds.bbci.co.uk/news/rss.xml?edition=uk");