What is the state of the art in HTML content extraction?

后端 未结 8 910
無奈伤痛
無奈伤痛 2021-01-29 23:52

There\'s a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g.,

8条回答
  •  故里飘歌
    2021-01-30 00:22

    Beautiful Soup is a robust HTML parser written in Python.

    It gracefully handles HTML with bad markup and is also well-engineered as a Python library, supporting generators for iteration and search, dot-notation for child access (e.g., access ' usingdoc.foo.bar`) and seamless unicode.

自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题