Is there a good way to remove HTML from a Java string? A simple regex like
replaceAll("\\\\<.*?>", &quo
I know it is been a while since this question as been asked, but I found another solution, this is what worked for me:
Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
Source source= new Source(htmlAsString);
Matcher m = REMOVE_TAGS.matcher(sourceStep.getTextExtractor().toString());
String clearedHtml= m.replaceAll("");
If you're writing for Android you can do this...
android.text.Html.fromHtml(instruction).toString()
You might want to replace <br/>
and </p>
tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.
The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines...
replaceAll("\\<[\s]*tag[^>]*>","")
Then HTML-decode special characters such as &
. The result should not be considered to be sanitized.
To get formateed plain html text you can do that:
String BR_ESCAPED = "<br/>";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");
To get formateed plain text change <br/> by \n and change last line by:
nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");
Sometimes the html string come from xml with such <
. When using Jsoup we need parse it and then clean it.
Document doc = Jsoup.parse(htmlstrl);
Whitelist wl = Whitelist.none();
String plain = Jsoup.clean(doc.text(), wl);
While only using Jsoup.parse(htmlstrl).text()
can't remove tags.
Alternatively, one can use HtmlCleaner:
private CharSequence removeHtmlFrom(String html) {
return new HtmlCleaner().clean(html).getText();
}