How can I extract only the main textual content from an HTML page?

前端未结

关注

 9  1419

旧巷少年郎 2021-01-31 04:48

Update

Boilerpipe appears to work really well, but I realized that I don\'t need only the main content because many pages don\'t have an article, but only links with s

9条回答

轻奢々 (楼主)

2021-01-31 05:22

You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).

Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:

Reader reader = ...
InputSource is = new InputSource(reader);

// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();

// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);

// iterate over all blocks (= segments as "ArticleExtractor" sees them) 
for (TextBlock block : getTextBlocks()) {
    // block.isContent() tells you if it's likely to be content or not 
    // block.getText() gives you the block's text
}

TextBlock has some more exciting methods, feel free to play around!

0 讨论(0)

查看其它9个回答