Apache Tika how to extract html body with out header and footer content

后端 未结 1 695
失恋的感觉
失恋的感觉 2021-01-19 04:59

I am looking to extract entire body content of html except header and footer, however I am getting exception

org.xml.sax.SAXException: Namespace http

相关标签:
1条回答
  • 2021-01-19 05:24

    Foudn a solution at based on research boilerpipe detection and this is integrated with apache tika and can be run with the below java code.

    import org.apache.tika.exception.TikaException;
    import org.apache.tika.io.TikaInputStream;
    import org.apache.tika.parser.AutoDetectParser;
    import org.apache.tika.parser.html.BoilerpipeContentHandler;
    import org.apache.tika.sax.BodyContentHandler;
    import org.xml.sax.ContentHandler;
    import org.xml.sax.SAXException;
    import org.apache.tika.metadata.Metadata;
    import java.io.File;
    import java.io.IOException;
    import java.io.InputStream;
    import java.net.URL;   
    
    public class NewtikaXpath {
        public static void main(String args[]) throws IOException, SAXException, TikaException {
            AutoDetectParser parser = new AutoDetectParser();
            ContentHandler textHandler = new BodyContentHandler();
            Metadata xmetadata = new Metadata();
            try  (InputStream stream = TikaInputStream.get(new URL("your favourite url"))){
                parser.parse(stream, new BoilerpipeContentHandler(textHandler), xmetadata);
                System.out.println("text:\n" + textHandler.toString());
            }
        }
    
    }
    

    You can have a simple demo of boilerpipe detection at.. and more information can be also available at..

    0 讨论(0)
提交回复
热议问题