How can I extract only the main textual content from an HTML page?

前端未结

关注

 9  1416

Update

Boilerpipe appears to work really well, but I realized that I don\'t need only the main content because many pages don\'t have an article, but only links with s

相关标签:

9条回答

礼貌的吻别

2021-01-31 05:33

My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.

On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)

0 讨论(0)
发布评论:

提交评论
- 加载中...
感动是毒

2021-01-31 05:39
Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

There are a few ways to feed HTML into Boilerpipe and extract HTML.

You can use a URL:
```
ArticleExtractor.INSTANCE.getText(url);
```
You can use a String:
```
ArticleExtractor.INSTANCE.getText(myHtml);
```
There are also options to use a Reader, which opens up a large number of options.
0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2021-01-31 05:43

You can use some libs like goose. It works best on articles/news. You can also check javascript code that does similar extraction as goose with the readability bookmarklet

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2