What is the state of the art in HTML content extraction?

后端未结

关注

 8  910

無奈伤痛 2021-01-29 23:52

There\'s a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g.,

8条回答

故里飘歌 (楼主)

2021-01-30 00:22

Beautiful Soup is a robust HTML parser written in Python.

It gracefully handles HTML with bad markup and is also well-engineered as a Python library, supporting generators for iteration and search, dot-notation for child access (e.g., access ' usingdoc.foo.bar`) and seamless unicode.

0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...