How can I extract only the main textual content from an HTML page?

前端 未结 9 1413
旧巷少年郎
旧巷少年郎 2021-01-31 04:48

Update

Boilerpipe appears to work really well, but I realized that I don\'t need only the main content because many pages don\'t have an article, but only links with s

9条回答
  •  礼貌的吻别
    2021-01-31 05:33

    My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.

    On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)

提交回复
热议问题