问题
I am developing some utility where, it would have to traverse through set of HTML files and manipulate them.
JSoup does wonderful job in parsing and manipulating the files which are complete (i.e. they have <html> ... </html>
tags).
However for the partial pages i.e. the page which wound contain markup like,
<div id="leftnav">...</div>
it parses correctly but when doc.toString()
or doc.outerHtml()
is called, it returns full HTML (it wraps the partial HTML content in <html> <body> ... </body> </html>
tags)
This is a problem for me, can you please let me know if such API is there in JSoup not to sanitize / clean the HTML content in such manner ?
Thanks.
回答1:
You can use the Xml Parser:
Create a new XML parser. This parser assumes no knowledge of the incoming tags and does not treat it as HTML, rather creates a simple tree directly from the input.
In other words: it doesn't create the typical html structure (html, body, head etc.) and takes your input as it is.
Here's how to use it:
// Using connect()
Document doc = Jsoup.connect("<url>").parser(Parser.xmlParser()).get();
// Using parse()
Document doc = Jsoup.parse("<html>", "<base url>", Parser.xmlParser());
来源:https://stackoverflow.com/questions/16473677/is-there-a-way-to-manipulate-partial-html-pages-using-jsoup