Filter out HTML tags and resolve entities in python

前端 未结 8 1747
暗喜
暗喜 2020-12-03 00:11

Because regular expressions scare me, I\'m trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.

8条回答
  •  有刺的猬
    2020-12-03 00:42

    You might need something more complicated than a regular expression. Web pages often have angle brackets that aren't part of a tag, like this:

     
    5 < 7

    Stripping the tags with regex will return the string "5 " and treat

     < 7

    as a single tag and strip it out.

    I suggest looking for already-written code that does this for you. I did a search and found this: http://zesty.ca/python/scrape.html It also can resolve HTML entities.

提交回复
热议问题