How can I extract only the main textual content from an HTML page?

前端 未结 9 1419
旧巷少年郎
旧巷少年郎 2021-01-31 04:48

Update

Boilerpipe appears to work really well, but I realized that I don\'t need only the main content because many pages don\'t have an article, but only links with s

9条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-01-31 05:22

    You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).

    Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:

    Reader reader = ...
    InputSource is = new InputSource(reader);
    
    // parse the document into boilerpipe's internal data structure
    TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
    
    // perform the extraction/classification process on "doc"
    ArticleExtractor.INSTANCE.process(doc);
    
    // iterate over all blocks (= segments as "ArticleExtractor" sees them) 
    for (TextBlock block : getTextBlocks()) {
        // block.isContent() tells you if it's likely to be content or not 
        // block.getText() gives you the block's text
    }
    

    TextBlock has some more exciting methods, feel free to play around!

提交回复
热议问题