Extract and Clean HTML Fragment using HTML Parser (org.htmlparser)

拜拜、爱过 提交于 2019-12-01 02:46:56

问题


I'm looking for an efficient approach to extracting a fragment of HTML from a web page and performing some specific operations on that HTML fragment.

The operations required are:

  1. Remove all tags that have a class of "hidden"
  2. Remove all script tags
  3. Remove all style tags
  4. Remove all event attributes (on*="*")
  5. Remove all style attributes

I've been using HTML Parser (org.htmlparser) for this task and have been able to meet all of the requirements, however, I don't feel that I have an elegant solution. Currently, I am parsing the web page with a CssSelectorNodeFilter (to get the fragment) and then re-parsing that fragment with a NodeVisitor in order to carry out the cleaning operations.

Could anybody suggest how they would tackle this problem? I would prefer to only parse the document once and perform all operations during that one parse.

Thanks in advance!


回答1:


Check out jsoup - it should handle all of your necessary tasks in an elegant way.

[Edit]

Here's a full working example per your required operations:

// Load and parse the document fragment.
File f = new File("myfile.html"); // See also Jsoup#parseBodyFragment(s)
Document doc = Jsoup.parse(f, "UTF-8", "http://example.com");

// Remove all script and style elements and those of class "hidden".
doc.select("script, style, .hidden").remove();

// Remove all style and event-handler attributes from all elements.
Elements all = doc.select("*");
for (Element el : all) { 
  for (Attribute attr : el.attributes()) { 
    String attrKey = attr.getKey();
    if (attrKey.equals("style") || attrKey.startsWith("on")) { 
      el.removeAttr(attrKey);
    } 
  }
}
// See also - doc.select("*").removeAttr("style");

You'll want to make sure things like case sensitivity don't matter for the attribute names but this should be the majority of what you need.



来源:https://stackoverflow.com/questions/8357855/extract-and-clean-html-fragment-using-html-parser-org-htmlparser

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!