I\'m manipulating a short HTML snippet with XPath; when I output the changed snippet back with $doc->saveHTML(), DOCTYPE
gets added, and HTML / BODY
Here how I've done it:
-- Quick helper function that gives you HTML contents for specific DOM element
function nodeContent($n, $outer=false) { $d = new DOMDocument('1.0'); $b = $d->importNode($n->cloneNode(true),true); $d->appendChild($b); $h = $d->saveHTML(); // remove outter tags if (!$outer) $h = substr($h,strpos($h,'>')+1,-(strlen($n->nodeName)+4)); return $h; }
-- Find body node in your doc and get its contents
$query = $xpath->query("//body")->item(0); if($query) { echo nodeContent($query); }
UPDATE 1:
Some extra info: Since PHP/5.3.6, DOMDocument->saveHTML() accepts an optional DOMNode parameter similarly to DOMDocument->saveXML(). You can do
$xpath = new DOMXPath($doc); $query = $xpath->query("//body")->item(0); echo $doc->saveHTML($query);
for others, the helper function will help
tl;dr
requires: PHP 5.4.0
and Libxml 2.6.0
$doc->loadHTML("<p>test</p>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
explanation
http://php.net/manual/en/domdocument.loadhtml.php "Since PHP 5.4.0 and Libxml 2.6.0, you may also use the options parameter to specify additional Libxml parameters."
LIBXML_HTML_NOIMPLIED
Sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of implied html/body... elements.
LIBXML_HTML_NODEFDTD
Sets HTML_PARSE_NODEFDTD flag, which prevents a default doctype being added when one is not found.
You have 2 ways to accomplish this:
$content = substr($content, strpos($content, '<html><body>') + 12); // Remove Everything Before & Including The Opening HTML & Body Tags.
$content = substr($content, 0, -14); // Remove Everything After & Including The Closing HTML & Body Tags.
Or even better is this way:
$dom->normalizeDocument();
$content = $dom->saveHTML();
saveHTML
can output a subset of document, meaning we can ask it to output every child node one by one, by traversing body.
$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p>
<a href="http://google.com"><img src="http://google.com/img.jpeg" alt=""></a>
<p>...to be one of those crowning achievements...</p>');
// manipulation goes here
// Let's traverse the body and output every child node
$bodyNode = $doc->getElementsByTagName('body')->item(0);
foreach ($bodyNode->childNodes as $childNode) {
echo $doc->saveHTML($childNode);
}
This might not be a most elegant solution, but it works. Alternatively, we can wrap all children nodes inside some container element (say a div
) and output only that container (but container tag will be included in the output).
Here's a version that doesn't extend DOMDocument, though I think extending is the proper approach, since you're trying to achieve functionality that isn't built-in to the DOM API.
Note: I'm interpreting "clean" and "without workarounds" as keeping all manipulation to the DOM API. As soon as you hit string manipulation, that's workaround territory.
What I'm doing, just as in the original answer, is leveraging DOMDocumentFragment to manipulate multiple nodes all sitting at the root level. There is no string manipulation going on, which to me qualifies as not being a workaround.
$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p><a href="http://www....."><img src="http://" alt=""></a><p>...to be one of those crowning achievements...</p>');
// Remove doctype node
$doc->doctype->parentNode->removeChild($doc->doctype);
// Remove html element, preserving child nodes
$html = $doc->getElementsByTagName("html")->item(0);
$fragment = $doc->createDocumentFragment();
while ($html->childNodes->length > 0) {
$fragment->appendChild($html->childNodes->item(0));
}
$html->parentNode->replaceChild($fragment, $html);
// Remove body element, preserving child nodes
$body = $doc->getElementsByTagName("body")->item(0);
$fragment = $doc->createDocumentFragment();
while ($body->childNodes->length > 0) {
$fragment->appendChild($body->childNodes->item(0));
}
$body->parentNode->replaceChild($fragment, $body);
// Output results
echo htmlentities($doc->saveHTML());
This solution is rather lengthy, but it's because it goes about it by extending the DOM in order to keep your end code as short as possible.
sliceOutNode
is where the magic happens. Let me know if you have any questions:
<?php
class DOMDocumentExtended extends DOMDocument
{
public function __construct( $version = "1.0", $encoding = "UTF-8" )
{
parent::__construct( $version, $encoding );
$this->registerNodeClass( "DOMElement", "DOMElementExtended" );
}
// This method will need to be removed once PHP supports LIBXML_NOXMLDECL
public function saveXML( DOMNode $node = NULL, $options = 0 )
{
$xml = parent::saveXML( $node, $options );
if( $options & LIBXML_NOXMLDECL )
{
$xml = $this->stripXMLDeclaration( $xml );
}
return $xml;
}
public function stripXMLDeclaration( $xml )
{
return preg_replace( "|<\?xml(.+?)\?>[\n\r]?|i", "", $xml );
}
}
class DOMElementExtended extends DOMElement
{
public function sliceOutNode()
{
$nodeList = new DOMNodeListExtended( $this->childNodes );
$this->replaceNodeWithNode( $nodeList->toFragment( $this->ownerDocument ) );
}
public function replaceNodeWithNode( DOMNode $node )
{
return $this->parentNode->replaceChild( $node, $this );
}
}
class DOMNodeListExtended extends ArrayObject
{
public function __construct( $mixedNodeList )
{
parent::__construct( array() );
$this->setNodeList( $mixedNodeList );
}
private function setNodeList( $mixedNodeList )
{
if( $mixedNodeList instanceof DOMNodeList )
{
$this->exchangeArray( array() );
foreach( $mixedNodeList as $node )
{
$this->append( $node );
}
}
elseif( is_array( $mixedNodeList ) )
{
$this->exchangeArray( $mixedNodeList );
}
else
{
throw new DOMException( "DOMNodeListExtended only supports a DOMNodeList or array as its constructor parameter." );
}
}
public function toFragment( DOMDocument $contextDocument )
{
$fragment = $contextDocument->createDocumentFragment();
foreach( $this as $node )
{
$fragment->appendChild( $contextDocument->importNode( $node, true ) );
}
return $fragment;
}
// Built-in methods of the original DOMNodeList
public function item( $index )
{
return $this->offsetGet( $index );
}
public function __get( $name )
{
switch( $name )
{
case "length":
return $this->count();
break;
}
return false;
}
}
// Load HTML/XML using our fancy DOMDocumentExtended class
$doc = new DOMDocumentExtended();
$doc->loadHTML('<p><strong>Title...</strong></p><a href="http://www....."><img src="http://" alt=""></a><p>...to be one of those crowning achievements...</p>');
// Remove doctype node
$doc->doctype->parentNode->removeChild( $doc->doctype );
// Slice out html node
$html = $doc->getElementsByTagName("html")->item(0);
$html->sliceOutNode();
// Slice out body node
$body = $doc->getElementsByTagName("body")->item(0);
$body->sliceOutNode();
// Pick your poison: XML or HTML output
echo htmlentities( $doc->saveXML( NULL, LIBXML_NOXMLDECL ) );
echo htmlentities( $doc->saveHTML() );