DOM in PHP: Decoded entities and setting nodeValue

前端 未结 2 519
刺人心
刺人心 2021-01-19 08:21

I want to perform certain manipulations on a XML document with PHP using the DOM part of its standard library. As others have already discovered, one has to deal with decode

相关标签:
2条回答
  • 2021-01-19 08:31

    As hakre explained, the problem is that in PHP's DOM library, the behaviour of setting nodeValue w.r.t. entities depends on the class of the node, in particular DOMText and DOMElement differ in this regard. To illustrate this, an example:

    $doc = new DOMDocument();
    $doc->formatOutput = True;
    $doc->loadXML('<root/>');
    
    $s = 'text &amp;&lt;<"\'&text;&text';
    
    $root = $doc->documentElement;
    
    $node = $doc->createElement('tag1', $s); #line 10
    $root->appendChild($node);
    
    $node = $doc->createElement('tag2');
    $text = $doc->createTextNode($s);
    $node->appendChild($text);
    $root->appendChild($node);
    
    $node = $doc->createElement('tag3');
    $text = $doc->createCDATASection($s);
    $node->appendChild($text);
    $root->appendChild($node);
    
    echo $doc->saveXML();
    

    outputs

    Warning: DOMDocument::createElement(): unterminated entity reference            text in /tmp/DOMtest.php on line 10
    <?xml version="1.0"?>
    <root>
      <tag1>text &amp;&lt;&lt;"'&text;</tag1>
      <tag2>text &amp;amp;&amp;lt;&lt;"'&amp;text;&amp;text</tag2>
      <tag3><![CDATA[text &amp;&lt;<"'&text;&text]]></tag3>
    </root>
    

    In this particular case, it is appropriate to alter the nodeValue of DOMText nodes. Combining hakre's two answers one gets a quite elegant solution.

    $doc = new DOMDocument();
    $doc->loadXML(<XML data>);
    
    $xpath     = new DOMXPath($doc);
    $node_list = $xpath->query(<some XPath>);
    
    $visitTextNode = function (DOMText $node) {
        $text = $node->textContent;
        /*
            do something with $text
        */
       $node->nodeValue = $text;
    };
    
    foreach ($node_list as $node) {
        if ($node->nodeType == XML_TEXT_NODE) {
            $visitTextNode($node);
        } else {
            foreach ($node->childNodes as $child) {
                if ($child->nodeType == XML_TEXT_NODE) {
                    $visitTextNode($child);
                }
            }
        }
    }
    
    0 讨论(0)
  • 2021-01-19 08:45

    Your question is basically whether or not setting DOMText::nodeValue to an XML encoded string or to a verbatim string.

    So let's just try that out and set it to & and '&amp; and see what happens:

    $doc = new DOMDocument();
    $doc->loadXML('<root>*</root>');
    
    $text = $doc->documentElement->childNodes->item(0);
    
    echo "Before Edit: ", $doc->saveXML($text), "\n";
    
    $text->nodeValue = "&";
    
    echo "After Edit 1: ", $doc->saveXML($text), "\n";
    
    $text->nodeValue = "&amp;";
    
    echo "After Edit 2: ", $doc->saveXML($text), "\n";
    

    The output then is as the following (PHP 5.0.0 - 5.5.0):

    Before Edit: *
    After Edit 1: &amp;
    After Edit 2: &amp;amp;
    

    This shows that setting the nodeValue of a DOMText-node expects a UTF-8 encoded string and the DOM library encodes the XML reserved characters automatically.

    So you should not apply htmlspecialchars() onto any text you add this way. That would create a double-encoding.

    As you write you experience the opposite I suggest you to execute an isolated PHP example on the commandline / within your IDE so that you can see exactly the output. Not that your browser renders this as HTML and then you think the reserved XML characters have not been encoded.


    As you have pointed out you're not editing a DOMText but an DOMElement node. It works a bit different, here the & character needs to be passed as entity &amp; instead of verbatim , however only this character.

    So this needs a little bit more work:

    1. Read out the text-content and turn it into a DOMText node. Everything will be perfectly encoded.
    2. Remove the node-value of the element node so it's empty.
    3. Append the DOMText node form first step as child.

    And done. Here your inner foreach modified showing this:

    foreach($node_list as $node) {
        $text = $doc->createTextNode($node->textContent);
        $node->nodeValue = "";
        $node->appendChild($text);
    }
    

    For your concrete example albeit I must admit I don't understand why you do that because this does not change the value so it wouldn't need this.

    Tip: In PHP DOMDocument can open this feed directly, you don't need curl here:

    $doc = new DOMDocument();
    $doc->load("http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
    
    0 讨论(0)
提交回复
热议问题