Is there a way to manipulate partial HTML pages using JSoup

问题

I am developing some utility where, it would have to traverse through set of HTML files and manipulate them.

JSoup does wonderful job in parsing and manipulating the files which are complete (i.e. they have <html> ... </html> tags).

However for the partial pages i.e. the page which wound contain markup like,

<div id="leftnav">...</div>

it parses correctly but when doc.toString() or doc.outerHtml() is called, it returns full HTML (it wraps the partial HTML content in <html> <body> ... </body> </html> tags)

This is a problem for me, can you please let me know if such API is there in JSoup not to sanitize / clean the HTML content in such manner ?

Thanks.

回答1:

You can use the Xml Parser:

Create a new XML parser. This parser assumes no knowledge of the incoming tags and does not treat it as HTML, rather creates a simple tree directly from the input.

In other words: it doesn't create the typical html structure (html, body, head etc.) and takes your input as it is.

Here's how to use it:

// Using connect()
Document doc = Jsoup.connect("<url>").parser(Parser.xmlParser()).get();

// Using parse()
Document doc = Jsoup.parse("<html>", "<base url>", Parser.xmlParser());

来源：https://stackoverflow.com/questions/16473677/is-there-a-way-to-manipulate-partial-html-pages-using-jsoup

标签

java

html

jsoup

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!