How can I extract only the main textual content from an HTML page?

前端 未结 9 1418
旧巷少年郎
旧巷少年郎 2021-01-31 04:48

Update

Boilerpipe appears to work really well, but I realized that I don\'t need only the main content because many pages don\'t have an article, but only links with s

9条回答
  •  醉话见心
    2021-01-31 05:23

    http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php

    Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.

    You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.

    If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site

提交回复
热议问题