Remove HTML tags from a String

后端 未结 30 3119
误落风尘
误落风尘 2020-11-21 07:35

Is there a good way to remove HTML from a Java string? A simple regex like

replaceAll("\\\\<.*?>", &quo         


        
相关标签:
30条回答
  • 2020-11-21 07:41

    I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:

    noHTMLString.replaceAll("\\&.*?\\;", "");
    

    instead of this:

    html = html.replaceAll("&nbsp;","");
    html = html.replaceAll("&amp;"."");
    
    0 讨论(0)
  • 2020-11-21 07:44

    Also very simple using Jericho, and you can retain some of the formatting (line breaks and links, for example).

        Source htmlSource = new Source(htmlText);
        Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
        Renderer htmlRend = new Renderer(htmlSeg);
        System.out.println(htmlRend.toString());
    
    0 讨论(0)
  • 2020-11-21 07:44

    One more way can be to use com.google.gdata.util.common.html.HtmlToText class like

    MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));
    

    This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also. However I believe for small/simple jobs this would be effective.

    0 讨论(0)
  • 2020-11-21 07:45

    One could also use Apache Tika for this purpose. By default it preserves whitespaces from the stripped html, which may be desired in certain situations:

    InputStream htmlInputStream = ..
    HtmlParser htmlParser = new HtmlParser();
    HtmlContentHandler htmlContentHandler = new HtmlContentHandler();
    htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata())
    System.out.println(htmlContentHandler.getBodyText().trim())
    
    0 讨论(0)
  • 2020-11-21 07:46

    Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.

    import java.io.*;
    import javax.swing.text.html.*;
    import javax.swing.text.html.parser.*;
    
    public class Html2Text extends HTMLEditorKit.ParserCallback {
        StringBuffer s;
    
        public Html2Text() {
        }
    
        public void parse(Reader in) throws IOException {
            s = new StringBuffer();
            ParserDelegator delegator = new ParserDelegator();
            // the third parameter is TRUE to ignore charset directive
            delegator.parse(in, this, Boolean.TRUE);
        }
    
        public void handleText(char[] text, int pos) {
            s.append(text);
        }
    
        public String getText() {
            return s.toString();
        }
    
        public static void main(String[] args) {
            try {
                // the HTML to convert
                FileReader in = new FileReader("java-new.html");
                Html2Text parser = new Html2Text();
                parser.parse(in);
                in.close();
                System.out.println(parser.getText());
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
    

    ref : Remove HTML tags from a file to extract only the TEXT

    0 讨论(0)
  • 2020-11-21 07:47

    Use a HTML parser instead of regex. This is dead simple with Jsoup.

    public static String html2text(String html) {
        return Jsoup.parse(html).text();
    }
    

    Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.

    See also:

    • RegEx match open tags except XHTML self-contained tags
    • What are the pros and cons of the leading Java HTML parsers?
    • XSS prevention in JSP/Servlet web application
    0 讨论(0)
提交回复
热议问题