python method to extract content (excluding navigation) from an HTML page

前端 未结 5 468
无人及你
无人及你 2021-01-31 23:13

Of course an HTML page can be parsed using any number of python parsers, but I\'m surprised that there don\'t seem to be any public parsing scripts to extract meaningful content

5条回答
  •  臣服心动
    2021-02-01 00:08

    Goose is just the library for this task. To quote their README:

    Goose will try to extract the following information:

    • Main text of an article
    • Main image of article
    • Any Youtube/Vimeo movies embedded in article
    • Meta Description
    • Meta tags

提交回复
热议问题