发表新帖

发表新帖

What are the pros and cons of the leading Java HTML parsers? [closed]

后端未结

关注

 6  871

别跟我提以往 2020-11-21 23:14

6条回答

眼角桃花 (楼主)

2020-11-21 23:47

Two other options are HTMLCleaner and HTMLParser.

I have tried most of the parsers here for a crawler / data extraction framework I have been developing. I use HTMLCleaner for the bulk of the data extraction work. This is because it supports a reasonably modern dialect of HTML, XHTML, HTML 5, with namespaces, and it supports DOM, so it is possible to use it with Java's built in XPath implementation.

It's a lot easier to do this with HTMLCleaner than some of the other parsers: JSoup for example supports a DOM like interface, rather than DOM, so some assembly required. Jericho has a SAX-line interface so again it is requires some work although Sujit Pal has a good description of how to do this but in the end HTMLCleaner just worked better.

I also use HTMLParser and Jericho for a table extraction task, which replaced some code written using Perl's libhtml-tableextract-perl. I use HTMLParser to filter the HTML for the table, then use Jericho to parse it. I agree with MJB's and Adam's comments that Jericho is good in some cases because it preserves the underlying HTML. It has a kind of non-standard SAX interface, so for XPath processing HTMLCleaner is better.

Parsing HTML in Java is a surprisingly hard problem as all the parsers seem to struggle on certain types of malformed HTML content.

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题