How to remove only html tags from text with Jsoup?

我们两清 提交于 2020-01-04 11:00:36

问题


I want to remove ONLY html tags from text with JSOUP. I used solution from here (my previous question about JSOUP) But after some checkings I discovered that JSOUP gets JAVA heap exception: OutOfMemoryError for big htmls but not for all. For example, it fails on html 2Mb and 10000 lines. Code throws an exception in the last line (NOT on Jsoup.parse):

public String StripHtml(String html){
  html = html.replace("&lt;", "<").replace("&gt;", ">");
  String[] tags = getAllStandardHtmlTags;
  Document thing = Jsoup.parse(html);
  for (String tag : tags) {
      for (Element elem : thing.getElementsByTag(tag)) {
          elem.parent().insertChildren(elem.siblingIndex(),elem.childNodes());
          elem.remove();
      }
  }
  return thing.html();
}

Is there a way to fix it?


回答1:


Alternatively, you can give a try to Jsoup cleaning capabilities. The code below will remove ALL html tags located in the passed html string.

public String StripHtml(String html) {
    return Jsoup.clean(html, Whitelist.none());
}

The whitelist (Whitelist.none()) tells the Jsoup cleaner which tags are allowed. As you can see, none html tags are allowed here. Any tags not referenced in the whitelist will be removed.

You may be interested by other provided whitelists:

  • Whitelist.basic()
  • Whitelist.basicWithImages()
  • Whitelist.none()
  • Whitelist.relaxed()
  • Whitelist.simpleText()

Those base whitelists can be customized by adding tags (see addTags method) or by removing tags (see removeTags method).

If you want to create your own whitelist (be careful !), here is the way to go:

Whitelist myCustomWhitelist = new Whitelist();
myCustomWhitelist.addTags("b", "em", ...);

See details here: Jsoup Whitelists

Jsoup 1.8.3




回答2:


After many searching in google and after some attempts to implement html stripper by myself, my solution is to use HTMLStripCharFilter class of Solr with replacing escapedTags to blackList with standard html tags.

  1. HTMLStripCharFilter is faster than JSOUP library and regexes for big size files
  2. HTMLStripCharFilter hasn't memory problem like JSOUP (Out of memory exception) for big size files
  3. HTMLStripCharFilter isn't entering to "catastrophic backtracking" like regexes



回答3:


I see two solutions:

  1. Increase the Java Heap space. It seems that generating the html as string needs more memory than you allow. Increasing the maximum JAVA heap can be done with the -Xmx command line parameter to the JVM:

    java -Xmx512m parsing.java

  2. You could switch from DOM based JSoup to a SAX based parser like nekohtml Such parsers can deal with any size html documents because they never build the complete DOM in memory.




回答4:


for me was sufficient to use combination of Jsoup methods:

Jsoup.clean(Jsoup.parse(htmlString).text(), Whitelist.simpleText()) 

whitelist you may choose...



来源:https://stackoverflow.com/questions/34563702/how-to-remove-only-html-tags-from-text-with-jsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!