What is the state of the art in HTML content extraction?

后端未结

关注

 8  904

There\'s a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g.,

相关标签:

8条回答

灰色年华

2021-01-30 00:32

If you are out to extract content from pages that heavily utilize javascript, selenium remote control can do the job. It works for more than just testing. The main downside of doing this is that you'll end up using a lot more resources. The upside is you'll get a much more accurate data feed from rich pages/apps.

0 讨论(0)
发布评论:

提交评论
- 加载中...
抹茶落季

2021-01-30 00:33

there are a few open source tools available that do similar article extraction tasks. https://github.com/jiminoc/goose which was open source by Gravity.com

It has info on the wiki as well as the source you can view. There are dozens of unit tests that show the text extracted from various articles.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2