Is there a good way to remove HTML from a Java string? A simple regex like
replaceAll("\\\\<.*?>", &quo
I think that the simpliest way to filter the html tags is:
private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
public static String removeTags(String string) {
if (string == null || string.length() == 0) {
return string;
}
Matcher m = REMOVE_TAGS.matcher(string);
return m.replaceAll("");
}
Use Html.fromHtml
HTML Tags are
<a href=”…”> <b>, <big>, <blockquote>, <br>, <cite>, <dfn>
<div align=”…”>, <em>, <font size=”…” color=”…” face=”…”>
<h1>, <h2>, <h3>, <h4>, <h5>, <h6>
<i>, <p>, <small>
<strike>, <strong>, <sub>, <sup>, <tt>, <u>
As per Android’s official Documentations any tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings.
Html.formHtml
method takes an Html.TagHandler
and an Html.ImageGetter as arguments as well as the text to parse.
String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";
Then
Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());
Output
This is about me text that the user can put into their profile
classeString.replaceAll("\\<(/?[^\\>]+)\\>", "\\ ").replaceAll("\\s+", " ").trim()
I often find that I only need to strip out comments and script elements. This has worked reliably for me for 15 years and can easily be extended to handle any element name in HTML or XML:
// delete all comments
response = response.replaceAll("<!--[^>]*-->", "");
// delete all script elements
response = response.replaceAll("<(script|SCRIPT)[^+]*?>[^>]*?<(/script|SCRIPT)>", "");
The accepted answer of doing simply Jsoup.parse(html).text()
has 2 potential issues (with JSoup 1.7.3):
<script>
into <script>
If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:
// breaks multi-level of escaping, preventing &lt;script&gt; to be rendered as <script>
String replace = input.replace("&", "");
// decode any encoded html, preventing <script> to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);
Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.
And here is a bunch of test cases (input to output):
{"regular string", "regular string"},
{"<a href=\"link\">A link</a>", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"<script>", ""},
{"&lt;script&gt;", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}
If you find a way to make it better, please let me know.
Here is another way to do it:
public static String removeHTML(String input) {
int i = 0;
String[] str = input.split("");
String s = "";
boolean inTag = false;
for (i = input.indexOf("<"); i < input.indexOf(">"); i++) {
inTag = true;
}
if (!inTag) {
for (i = 0; i < str.length; i++) {
s = s + str[i];
}
}
return s;
}