In this post I asked if there were any tools that compare the structure (not actual content) of 2 HTML pages. I ask because I receive HTML templates from our designers, and freq
The DOM is a data structure - it's a tree.
See http://www.semdesigns.com/Products/SmartDifferencer/index.html for a tool that is parameterized by langauge grammar, and produces deltas in terms of language elements (identifiers, expressions, statements, blocks, methods, ...) inserted, deleted, moved, replaced, or has identifiers substituted across it consistently. This tool ignores whitespace reformatting (e.g., different linebreaks or layouts) and semantically indistinguishable values (e.g., it knows that 0x0F and 15 are the same value). This can be applied to HTML using an HTML parser.
EDIT: 9/12/2009. We've built an experimental SmartDiff tool using an HTML editor.
If i was to tacke this issue I would do this:
In your example you would have only a div element object loaded on one side, on the other side you would have a div element object loaded with 1 child element of type paragraph element. fire up your iterator, first you'll match up the div element, second iterator you'll match up paragraph with nothing. You've got your structural difference.
You may also have to consider that the 'content' itself could contain additional mark-up so it's probably worth stripping out everything within certain elements (like <div>
s with certain IDs or classes) before you do your comparison. For example:
<div id="mainContent">
<p>lorem ipsum etc..</p>
</div>
and
<div id="mainContent">
<p>Here is some real content<img class="someImage" src="someImage.jpg" /></p>
<ul>
<li>and</li>
<li>some</li>
<li>more..</li>
</ul>
</div>