What I am trying to do is include an HTML file within a PHP system (not a problem) but that HTML file also needs to be usable on its own, for various reasons, so I need to know
You may want to use PHP tidy extension which can fix invalid XHTML structures (in which case DOMDocument load crashes) and also extract body only:
$tidy = new tidy();
$htmlBody = $tidy->repairString($html, array(
'output-xhtml' => true,
'show-body-only' => true,
), 'utf8');
Then load extracted body into DOMDocument:
$xml = new DOMDocument();
$xml->loadHTML($htmlBody);
Then traverse, extract, move around XML nodes etc .. and save:
$output = $xml->saveXML();
A solution with only one instance of DOMDocument and without loops
$d = new DOMDocument();
$d->loadHTML(file_get_contents('/path/to/my.html'));
$body = $d->getElementsByTagName('body')->item(0);
echo $d->saveHTML($body);
Use a DOM parser. this is not tested but ought to do what you want
$domDoc = new DOMDocument();
$domDoc.loadHTMLFile('/path/to/file');
$body = $domDoc->GetElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
echo $child->C14N(); //Note this cannonicalizes the representation of the node, but that's not necessarily a bad thing
}
If you want to avoid cannonicalization, you can use this version (thanks to @Jared Farrish)
This may be a solution. I tried it and it works fine.
function parseHTML(string) {
var parser = new DOMParser
, result = parser.parseFromString(string, "text/html");
return result.firstChild.lastChild.firstChild;
}
$site = file_get_contents("http://www.google.com/");
preg_match("/<body[^>]*>(.*?)<\/body>/is", $site, $matches);
echo($matches[1]);
Use DOMDocument to keep what you need rather than strip what you don't need (PHP >= 5.3.6)
$d = new DOMDocument;
$d->loadHTMLFile($fileLocation);
$body = $d->getElementsByTagName('body')->item(0);
// perform innerhtml on $body by enumerating child nodes
// and saving them individually
foreach ($body->childNodes as $childNode) {
echo $d->saveHTML($childNode);
}