Is there a good way to remove HTML from a Java string? A simple regex like
replaceAll("\\\\<.*?>", &quo
HTML Escaping is really hard to do right- I'd definitely suggest using library code to do this, as it's a lot more subtle than you'd think. Check out Apache's StringEscapeUtils for a pretty good library for handling this in Java.
It sounds like you want to go from HTML to plain text.
If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL.
It makes use of org.htmlparser.beans.StringBean.
static public String getUrlContentsAsText(String url) {
String content = "";
StringBean stringBean = new StringBean();
stringBean.setURL(url);
content = stringBean.getStrings();
return content;
}
You can simply use the Android's default HTML filter
public String htmlToStringFilter(String textToFilter){
return Html.fromHtml(textToFilter).toString();
}
The above method will return the HTML filtered string for your input.
My 5 cents:
String[] temp = yourString.split("&");
String tmp = "";
if (temp.length > 1) {
for (int i = 0; i < temp.length; i++) {
tmp += temp[i] + "&";
}
yourString = tmp.substring(0, tmp.length() - 1);
}
If the user enters <b>hey!</b>
, do you want to display <b>hey!</b>
or hey!
? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:
replaceAll("\\<[^>]*>","")
but you will run into issues if the user enters something malformed, like <bhey!</b>
.
You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.
The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.
Here is one more variant of how to replace all(HTML Tags | HTML Entities | Empty Space in HTML content)
content.replaceAll("(<.*?>)|(&.*?;)|([ ]{2,})", "");
where content is a String.