In this post I asked if there were any tools that compare the structure (not actual content) of 2 HTML pages. I ask because I receive HTML templates from our designers, and freq
This has been an excellent start. A few more clarifications/comments:
further thought: I think a good start would be to assume the html is XHTML compliant. I could then infer the schema (using the new .net XmlSchemaInference methods), then diff the schemata. I can then look at the differences and consider whether or not they're significant.
Run both files through the following Perl script, then use diff -iw to do a case-insensitive, whitespace-ignoring diff.
#! /usr/bin/perl -w
use strict;
undef $/;
my $html = <STDIN>;
while ($html =~ /\S/) {
if ($html =~ s/^\s*<//) {
$html =~ s/^(.*?)>// or die "malformed HTML";
print "<$1>\n";
} else {
$html =~ s/^([^<]+)//;
print "(text)\n";
}
}
Take a look at beyond compare. It has an XML comparison feature that can help you out.
I would use (or contribute to) html5lib
and its SAX output. Just zip through the 2 SAX streams looking for mismatches and highlight the whole corresponding subtree.
@Mike - that would compare everything, including the content of the page, which isn't want the original poster wanted.
Assuming that you have access to the browser's DOM (by writing a Firefox/IE plugin or whatever), I would probably put all of the HTML elements into a tree, then compare the two trees. If the tag name is different, then the node is different. You might want to stop enumerating at a certain point (you probably don't care about span, bold, italic, etc. - maybe only worry about divs?), since some tags are really the content, rather than the structure, of the page.
Pretty Diff can do this. It will compare the code structure only regardless of differences to white space, comments, or even content. Just be sure to check the option "Normalize Content and String Literals".
http://prettydiff.com/