How can I extract only the main textual content from an HTML page?

前端 未结 9 1433
旧巷少年郎
旧巷少年郎 2021-01-31 04:48

Update

Boilerpipe appears to work really well, but I realized that I don\'t need only the main content because many pages don\'t have an article, but only links with s

9条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2021-01-31 05:19

    There appears to be a possible problem with Boilerpipe. Why? Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.

    So one can crudely classify web pages into three kinds in respect to Boilerpipe:

    1. a web page with a single article in it (Boilerpipe worthy!)
    2. a web with multiple articles in it, such as the front page of the New York times
    3. a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.

    Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.

提交回复
热议问题