Remove HTML tags from a String

后端 未结 30 3093
误落风尘
误落风尘 2020-11-21 07:35

Is there a good way to remove HTML from a Java string? A simple regex like

replaceAll("\\\\<.*?>", &quo         


        
相关标签:
30条回答
  • 2020-11-21 07:56

    I know it is been a while since this question as been asked, but I found another solution, this is what worked for me:

    Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
        Source source= new Source(htmlAsString);
     Matcher m = REMOVE_TAGS.matcher(sourceStep.getTextExtractor().toString());
                            String clearedHtml= m.replaceAll("");
    
    0 讨论(0)
  • 2020-11-21 07:58

    If you're writing for Android you can do this...

    android.text.Html.fromHtml(instruction).toString()
    
    0 讨论(0)
  • 2020-11-21 07:59

    You might want to replace <br/> and </p> tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.

    The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines...

    replaceAll("\\<[\s]*tag[^>]*>","")
    

    Then HTML-decode special characters such as &amp;. The result should not be considered to be sanitized.

    0 讨论(0)
  • 2020-11-21 07:59

    To get formateed plain html text you can do that:

    String BR_ESCAPED = "&lt;br/&gt;";
    Element el=Jsoup.parse(html).select("body");
    el.select("br").append(BR_ESCAPED);
    el.select("p").append(BR_ESCAPED+BR_ESCAPED);
    el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
    el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
    el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
    el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
    el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
    String nodeValue=el.text();
    nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
    nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");
    

    To get formateed plain text change <br/> by \n and change last line by:

    nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");
    
    0 讨论(0)
  • 2020-11-21 07:59

    Sometimes the html string come from xml with such &lt. When using Jsoup we need parse it and then clean it.

    Document doc = Jsoup.parse(htmlstrl);
    Whitelist wl = Whitelist.none();
    String plain = Jsoup.clean(doc.text(), wl);
    

    While only using Jsoup.parse(htmlstrl).text() can't remove tags.

    0 讨论(0)
  • 2020-11-21 08:01

    Alternatively, one can use HtmlCleaner:

    private CharSequence removeHtmlFrom(String html) {
        return new HtmlCleaner().clean(html).getText();
    }
    
    0 讨论(0)
提交回复
热议问题