I\'m using the DOM extension in PHP to build some HTML documents, and I want the output to be formatted nicely (with new lines and indentation) so that it\'s readable, howev
you're right, there seems to be no indentation for HTML (others are also confused). XML works, even with loaded code.
<?php
function tidyHTML($buffer) {
// load our document into a DOM object
$dom = new DOMDocument();
// we want nice output
$dom->preserveWhiteSpace = false;
$dom->loadHTML($buffer);
$dom->formatOutput = true;
return($dom->saveHTML());
}
// start output buffering, using our nice
// callback function to format the output.
ob_start("tidyHTML");
?>
<html>
<head>
<title>foo bar</title><meta name="bar" value="foo"><body><h1>bar foo</h1><p>It's like comparing apples to oranges.</p></body></html>
<?php
// this will be called implicitly, but we'll
// call it manually to illustrate the point.
ob_end_flush();
?>
result:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>foo bar</title>
<meta name="bar" value="foo">
</head>
<body>
<h1>bar foo</h1>
<p>It's like comparing apples to oranges.</p>
</body>
</html>
the same with saveXML() ...
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>foo bar</title>
<meta name="bar" value="foo"/>
</head>
<body>
<h1>bar foo</h1>
<p>It's like comparing apples to oranges.</p>
</body>
</html>
probably forgot to set preserveWhiteSpace=false before loadHTML?
disclaimer: i stole most of the demo code from tyson clugg/php manual comments. lazy me.
UPDATE: i now remember some years ago i tried the same thing and ran into the same problem. i fixed this by applying a dirty workaround (wasn't performance critical): i just somehow converted around between SimpleXML and DOM until the problem vanished. i suppose the conversion got rid of those nodes. maybe load with dom, import with
simplexml_import_dom
, then output the string, parse this with DOM again and then printed it pretty. as far as i remember this worked (but it was really slow).
The result:
<!DOCTYPE html>
<html>
<head>
<title>My website</title>
</head>
</html>
Please consider:
function indentContent($content, $tab="\t"){
$content = preg_replace('/(>)(<)(\/*)/', "$1\n$2$3", $content); // add marker linefeeds to aid the pretty-tokeniser (adds a linefeed between all tag-end boundaries)
$token = strtok($content, "\n"); // now indent the tags
$result = ''; // holds formatted version as it is built
$pad = 0; // initial indent
$matches = array(); // returns from preg_matches()
// scan each line and adjust indent based on opening/closing tags
while ($token !== false && strlen($token)>0){
$padPrev = $padPrev ?: $pad; // previous padding //Artis
$token = trim($token);
// test for the various tag states
if (preg_match('/.+<\/\w[^>]*>$/', $token, $matches)){// 1. open and closing tags on same line - no change
$indent=0;
}elseif(preg_match('/^<\/\w/', $token, $matches)){// 2. closing tag - outdent now
$pad--;
if($indent>0) $indent=0;
}elseif(preg_match('/^<\w[^>]*[^\/]>.*$/', $token, $matches)){// 3. opening tag - don't pad this one, only subsequent tags (only if it isn't a void tag)
foreach($matches as $m){
if (preg_match('/^<(area|base|br|col|command|embed|hr|img|input|keygen|link|meta|param|source|track|wbr)/im', $m)){// Void elements according to http://www.htmlandcsswebdesign.com/articles/voidel.php
$voidTag=true;
break;
}
}
$indent = 1;
}else{// 4. no indentation needed
$indent = 0;
}
$line = str_pad($token, strlen($token)+$pad, $tab, STR_PAD_LEFT);// pad the line with the required number of leading spaces
$result .= $line."\n"; // add to the cumulative result, with linefeed
$token = strtok("\n"); // get the next token
$pad += $indent; // update the pad size for subsequent lines
if($voidTag){
$voidTag=false;
$pad--;
}
}
return $result;
}
//$htmldoc - DOMdocument Object!
$niceHTMLwithTABS = indentContent($htmldoc->saveHTML(), $tab="\t");
echo $niceHTMLwithTABS;
Will result in HTML that has:
The function (which is a method for class I use) is largely based on: https://stackoverflow.com/a/7840997/7646824
You can use the code for the hl_tidy function of the htmLawed library.
// indent using one tab per indent, with all HTML being within an imaginary div
$out = hl_tidy($in, 't', 'div')