Remove HTML tags from a String

后端 未结 30 3234
误落风尘
误落风尘 2020-11-21 07:35

Is there a good way to remove HTML from a Java string? A simple regex like

replaceAll("\\\\<.*?>", &quo         


        
30条回答
  •  青春惊慌失措
    2020-11-21 07:46

    Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.

    import java.io.*;
    import javax.swing.text.html.*;
    import javax.swing.text.html.parser.*;
    
    public class Html2Text extends HTMLEditorKit.ParserCallback {
        StringBuffer s;
    
        public Html2Text() {
        }
    
        public void parse(Reader in) throws IOException {
            s = new StringBuffer();
            ParserDelegator delegator = new ParserDelegator();
            // the third parameter is TRUE to ignore charset directive
            delegator.parse(in, this, Boolean.TRUE);
        }
    
        public void handleText(char[] text, int pos) {
            s.append(text);
        }
    
        public String getText() {
            return s.toString();
        }
    
        public static void main(String[] args) {
            try {
                // the HTML to convert
                FileReader in = new FileReader("java-new.html");
                Html2Text parser = new Html2Text();
                parser.parse(in);
                in.close();
                System.out.println(parser.getText());
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
    

    ref : Remove HTML tags from a file to extract only the TEXT

提交回复
热议问题