php DomDocument adds extra tags

后端 未结 5 689
伪装坚强ぢ
伪装坚强ぢ 2020-12-19 03:09

I\'m trying to parse a document and get all the image tags and change the source for something different.


    $domDocument = new DOMDocument();

    $domDo         


        
相关标签:
5条回答
  • 2020-12-19 03:51

    you can use http://beerpla.net/projects/smartdomdocument-a-smarter-php-domdocument-class/ :

    DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain and tags, it adds them automatically (yup, there are no flags to turn this behavior off).

    Thus, when you call $doc->saveHTML(), your newly saved content now has and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).

    SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want – it saves HTML without adding that extra garbage that DOMDocument does.

    0 讨论(0)
  • 2020-12-19 03:58

    You just need to add 2 flags to the loadHTML() method: LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD. I.e.

    $domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
    

    See IDEONE demo:

    $text = '<p>Hi, this is a test, here is an image<img src="http://example.com/beer.jpg" width="60" height="95" /> Because I like Beer!</p>';
    $domDocument = new DOMDocument;
    $domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
    $imageNodeList = $domDocument->getElementsByTagName('img');
    
    foreach ($imageNodeList as $Image) {
          $Image->setAttribute('src', 'lalala');
          $domDocument->saveHTML($Image);
    }
    
    $text = $domDocument->saveHTML();
    echo $text;
    

    Output:

    <p>Hi, this is a test, here is an image<img src="lalala" width="60" height="95"> Because I like Beer!</p>
    
    0 讨论(0)
  • 2020-12-19 04:00

    DomDocument is unfortunately retarded and won't let you do this. Try this:

    $text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $domDocument->saveHTML()));
    
    0 讨论(0)
  • 2020-12-19 04:00

    If you are up to a hack, this is the way I managed to go around this annoyance. Load the string as XML and save it as HTML. :)

    0 讨论(0)
  • 2020-12-19 04:02

    If you're going to save as HTML, you have to expect a valid HTML document to be created!

    There is another option: DOMDocument::saveXML has an optional parameter allowing you to access the XML content of a particular element:

    $el = $domDocument->getElementsByTagName('p')->item(0);
    $text = $domDocument->saveXML($el);
    

    This presumes that your content only has one p element.

    0 讨论(0)
提交回复
热议问题