I have been using PHP\'s DOM to load an html template, modify it and output it. Recently I discovered that self-closing (empty) tags don\'t include a closing slash, even tho
DOMDocument->saveHTML()
takes your XML DOM infoset and writes it out as old-school HTML, not XML. You should not use saveHTML()
together with an XHTML doctype, as its output won't be well-formed XML.
If you use saveXML()
instead, you'll get proper XHTML. It's fine to serve this XML output to standards-compliant browsers if you give it a Content-Type: application/xhtml+xml
header. But unfortunately IE6-8 won't be able to read that, as they can still only handle old-school HTML, under the text/html
media type.
The usual compromise solution is to serve text/html
and use ‘HTML-compatible XHTML’ as outlined in Appendix C of the XHTML 1.0 spec. But sadly there is no PHP DOMDocument->saveXHTML()
method to generate the correct output for this.
There are some things you can do to persuade saveXML()
to produce HTML-compatible output for some common cases. The main one is that you have to ensure that only elements defined by HTML4 as having an EMPTY
content model (<img>
, <br>
etc) actually do have empty content, causing the self-closing syntax (<img/>
) to be used. Other elements must not use the self-closing syntax, so if they're empty you should put a space in their text content to stop them being so:
<script src="x.js"/> <-- no good, confuses HTML parser and breaks page
<script src="x.js"> </script> <-- fine
The other one to look out for is handling of the inline <script>
and <style>
elements, which are normal elements in XHTML but special CDATA
-content elements in HTML. Some /*<![CDATA[*/.../*]]>*/
wrapping is required to make any <
or &
characters inside them behave mostly-consistently, though note you still have to avoid the ]]>
and </
sequences.
If you want to really do it properly you would have to write your own HTML-compatible-XHTML serialiser. Long-term that would probably be a better option. But for small simple cases, hacking your input so that it doesn't contain anything that would come out the other end of an XML serialiser as incompatible with HTML is probably the quick solution.
That or just suck it up and live with old-school non-XML HTML, obviously.
This is an old question, but...
As other's have stated, PHP's DOM leaves much to be desired...
Here's a regEx to close "void" tags if you so desire
$voidTags = array('area','base','br','col','command','embed','hr','img','input','keygen','link','meta','param','source','track','wbr');
$regEx = '#<('.implode('|', $voidTags).')(\b[^>]*)>#';
$html = preg_replace($regEx, '<\\1\\2 />', $html);
doctype issue as it's text/html the closing slash isn't needed, you only need closing slash if it is an xhtml doc
noted you've updated to add in the doctype, but PHP dom also looks at that meta tag you've got in there, and content="text/html; charset=utf-8" clearly isn't XML based, it's just text/html :)
aside: DOM api also picks up the charset from there