问题
I have embedded HTML Tidy in my application to clean incoming HTML. But Tidy has a huge amount of bugs and fixing them directly in the source is my worst nightmare. Tidy source code is an unreadable abomination. Thousand+ line functions, poor variable naming, spaghetti code etc. It's truly horrible.
Worse yet, official development seems to have ceased. In the last 12 months, there have been three write transactions to the official CVS repo. But it's been dead and buried for much longer than that...
So I'm looking for an OSS C or C++ application/library that can do what Tidy can (when it feels like it): fix bad HTML markup and transform it into valid XHTML (this is the part I'm interested in). And I mean all sorts of bad markup.
Is there something like that out there?
EDIT: I need it both for manipulations on the DOM tree by an XML handling tool and for general compliance with the XHTML spec. My app needs to accept HTML from users (which is often invalid in all sorts of ways) and output valid XHTML. It needs to be able to handle even HTML that would normally not display in a browser because the user edited it by hand and didn't check afterwards.
A drop-in replacement for Tidy's error-correcting parser... that doesn't suck. I don't mind bugs if the source is readable and I can fix problems myself, or if there are active developers who provide bugfixes on a timely basis.
回答1:
Could you tell us what you plan to use this tool for? As in, do you want to fix static web pages, or do you want some sort of filtering step before other manipulations, so that some tool can handle buggy web pages?
Personally, I write my own tool atop Python's BeautifulSoup or lxml whenever I need to --- it's at most a dozen line script and does much of what I want.
回答2:
There is a new, nice, proper HTML 5 supporting Tidy, so the alternative to old, ugly Tidy would be Tidy (GitHub repository).
回答3:
Try Pretty Diff. It is a vastly superior beautification algorithm and it does not make any assumptions about your input.
http://prettydiff.com/?m=beautify&html
回答4:
For something that actually fixes code, your best bet is still HTML Tidy. There are a lot of linters, but not really anything that repairs errors to HTML, other than Tidy.
At first glance, modern OOP programmers might think that the source code is an unreadable abomination, but in the C world, Tidy is pretty sophisticated library that uses a lot of advanced OO concepts and offers a very thoughtful interface that exposes nearly all of its functionality in a pure C API.
A casual developer will be lost, but once immersed, the code is quite beautiful. Granted, naming conventions are a mixed bad, but PR's are welcome!
来源:https://stackoverflow.com/questions/2306980/is-there-an-alternative-to-html-tidy