What is the state of the art in HTML content extraction?

后端 未结 8 890
無奈伤痛
無奈伤痛 2021-01-29 23:52

There\'s a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g.,

8条回答
  •  抹茶落季
    2021-01-30 00:33

    there are a few open source tools available that do similar article extraction tasks. https://github.com/jiminoc/goose which was open source by Gravity.com

    It has info on the wiki as well as the source you can view. There are dozens of unit tests that show the text extracted from various articles.

提交回复
热议问题