I'm looking for an efficient approach to extracting a fragment of HTML from a web page and performing some specific operations on that HTML fragment.
The operations required are:
- Remove all tags that have a class of "hidden"
- Remove all script tags
- Remove all style tags
- Remove all event attributes (on*="*")
- Remove all style attributes
I've been using HTML Parser (org.htmlparser) for this task and have been able to meet all of the requirements, however, I don't feel that I have an elegant solution. Currently, I am parsing the web page with a CssSelectorNodeFilter (to get the fragment) and then re-parsing that fragment with a NodeVisitor in order to carry out the cleaning operations.
Could anybody suggest how they would tackle this problem? I would prefer to only parse the document once and perform all operations during that one parse.
Thanks in advance!
Check out jsoup - it should handle all of your necessary tasks in an elegant way.
[Edit]
Here's a full working example per your required operations:
// Load and parse the document fragment.
File f = new File("myfile.html"); // See also Jsoup#parseBodyFragment(s)
Document doc = Jsoup.parse(f, "UTF-8", "http://example.com");
// Remove all script and style elements and those of class "hidden".
doc.select("script, style, .hidden").remove();
// Remove all style and event-handler attributes from all elements.
Elements all = doc.select("*");
for (Element el : all) {
for (Attribute attr : el.attributes()) {
String attrKey = attr.getKey();
if (attrKey.equals("style") || attrKey.startsWith("on")) {
el.removeAttr(attrKey);
}
}
}
// See also - doc.select("*").removeAttr("style");
You'll want to make sure things like case sensitivity don't matter for the attribute names but this should be the majority of what you need.
来源:https://stackoverflow.com/questions/8357855/extract-and-clean-html-fragment-using-html-parser-org-htmlparser