Remove HTML tags from a String

后端 未结 30 3101
误落风尘
误落风尘 2020-11-21 07:35

Is there a good way to remove HTML from a Java string? A simple regex like

replaceAll("\\\\<.*?>", &quo         


        
30条回答
  •  被撕碎了的回忆
    2020-11-21 07:53

    The accepted answer did not work for me for the test case I indicated: the result of "a < b or b > c" is "a b or b > c".

    So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):

    import java.io.IOException;
    import java.io.StringReader;
    import java.util.logging.Logger;
    
    import org.ccil.cowan.tagsoup.Parser;
    import org.xml.sax.Attributes;
    import org.xml.sax.ContentHandler;
    import org.xml.sax.InputSource;
    import org.xml.sax.Locator;
    import org.xml.sax.SAXException;
    import org.xml.sax.XMLReader;
    
    /**
     * Take HTML and give back the text part while dropping the HTML tags.
     *
     * There is some risk that using TagSoup means we'll permute non-HTML text.
     * However, it seems to work the best so far in test cases.
     *
     * @author dan
     * @see TagSoup 
     */
    public class Html2Text2 implements ContentHandler {
    private StringBuffer sb;
    
    public Html2Text2() {
    }
    
    public void parse(String str) throws IOException, SAXException {
        XMLReader reader = new Parser();
        reader.setContentHandler(this);
        sb = new StringBuffer();
        reader.parse(new InputSource(new StringReader(str)));
    }
    
    public String getText() {
        return sb.toString();
    }
    
    @Override
    public void characters(char[] ch, int start, int length)
        throws SAXException {
        for (int idx = 0; idx < length; idx++) {
        sb.append(ch[idx+start]);
        }
    }
    
    @Override
    public void ignorableWhitespace(char[] ch, int start, int length)
        throws SAXException {
        sb.append(ch);
    }
    
    // The methods below do not contribute to the text
    @Override
    public void endDocument() throws SAXException {
    }
    
    @Override
    public void endElement(String uri, String localName, String qName)
        throws SAXException {
    }
    
    @Override
    public void endPrefixMapping(String prefix) throws SAXException {
    }
    
    
    @Override
    public void processingInstruction(String target, String data)
        throws SAXException {
    }
    
    @Override
    public void setDocumentLocator(Locator locator) {
    }
    
    @Override
    public void skippedEntity(String name) throws SAXException {
    }
    
    @Override
    public void startDocument() throws SAXException {
    }
    
    @Override
    public void startElement(String uri, String localName, String qName,
        Attributes atts) throws SAXException {
    }
    
    @Override
    public void startPrefixMapping(String prefix, String uri)
        throws SAXException {
    }
    }
    

提交回复
热议问题