Remove HTML tags from a String

后端 未结 30 3089
误落风尘
误落风尘 2020-11-21 07:35

Is there a good way to remove HTML from a Java string? A simple regex like

replaceAll("\\\\<.*?>", &quo         


        
相关标签:
30条回答
  • 2020-11-21 07:38

    I think that the simpliest way to filter the html tags is:

    private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
    
    public static String removeTags(String string) {
        if (string == null || string.length() == 0) {
            return string;
        }
    
        Matcher m = REMOVE_TAGS.matcher(string);
        return m.replaceAll("");
    }
    
    0 讨论(0)
  • 2020-11-21 07:38

    Use Html.fromHtml

    HTML Tags are

    <a href=”…”> <b>,  <big>, <blockquote>, <br>, <cite>, <dfn>
    <div align=”…”>,  <em>, <font size=”…” color=”…” face=”…”>
    <h1>,  <h2>, <h3>, <h4>,  <h5>, <h6>
    <i>, <p>, <small>
    <strike>,  <strong>, <sub>, <sup>, <tt>, <u>
    

    As per Android’s official Documentations any tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings.

    Html.formHtml method takes an Html.TagHandler and an Html.ImageGetter as arguments as well as the text to parse.

    Example

    String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";
    

    Then

    Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());
    

    Output

    This is about me text that the user can put into their profile

    0 讨论(0)
  • 2020-11-21 07:39
    classeString.replaceAll("\\<(/?[^\\>]+)\\>", "\\ ").replaceAll("\\s+", " ").trim() 
    
    0 讨论(0)
  • 2020-11-21 07:39

    I often find that I only need to strip out comments and script elements. This has worked reliably for me for 15 years and can easily be extended to handle any element name in HTML or XML:

    // delete all comments
    response = response.replaceAll("<!--[^>]*-->", "");
    // delete all script elements
    response = response.replaceAll("<(script|SCRIPT)[^+]*?>[^>]*?<(/script|SCRIPT)>", "");
    
    0 讨论(0)
  • 2020-11-21 07:40

    The accepted answer of doing simply Jsoup.parse(html).text() has 2 potential issues (with JSoup 1.7.3):

    • It removes line breaks from the text
    • It converts text &lt;script&gt; into <script>

    If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:

    // breaks multi-level of escaping, preventing &amp;lt;script&amp;gt; to be rendered as <script>
    String replace = input.replace("&amp;", "");
    // decode any encoded html, preventing &lt;script&gt; to be rendered as <script>
    String html = StringEscapeUtils.unescapeHtml(replace);
    // remove all html tags, but maintain line breaks
    String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
    // decode html again to convert character entities back into text
    return StringEscapeUtils.unescapeHtml(clean);
    

    Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.

    And here is a bunch of test cases (input to output):

    {"regular string", "regular string"},
    {"<a href=\"link\">A link</a>", "A link"},
    {"<script src=\"http://evil.url.com\"/>", ""},
    {"&lt;script&gt;", ""},
    {"&amp;lt;script&amp;gt;", "lt;scriptgt;"}, // best effort
    {"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}
    

    If you find a way to make it better, please let me know.

    0 讨论(0)
  • 2020-11-21 07:40

    Here is another way to do it:

    public static String removeHTML(String input) {
        int i = 0;
        String[] str = input.split("");
    
        String s = "";
        boolean inTag = false;
    
        for (i = input.indexOf("<"); i < input.indexOf(">"); i++) {
            inTag = true;
        }
        if (!inTag) {
            for (i = 0; i < str.length; i++) {
                s = s + str[i];
            }
        }
        return s;
    }
    
    0 讨论(0)
提交回复
热议问题