Filter out HTML tags and resolve entities in python

前端 未结 8 1748
暗喜
暗喜 2020-12-03 00:11

Because regular expressions scare me, I\'m trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.

相关标签:
8条回答
  • 2020-12-03 00:20

    Use lxml which is the best xml/html library for python.

    import lxml.html
    t = lxml.html.fromstring("...")
    t.text_content()
    

    And if you just want to sanitize the html look at the lxml.html.clean module

    0 讨论(0)
  • 2020-12-03 00:26

    Looking at the amount of sense people are demonstrating in other answers here, I'd say that using a regex probably isn't the best idea for your situation. Go for something tried and tested, and treat my previous answer as a demonstration that regexes need not be that scary.

    0 讨论(0)
  • 2020-12-03 00:32

    Regular expressions are not scary, but writing your own regexes to strip HTML is a sure path to madness (and it won't work, either). Follow the path of wisdom, and use one of the many good HTML-parsing libraries.

    Lucas' example is also broken because "sub" is not a method of a Python string. You'd have to "import re", then call re.sub(pattern, repl, string). But that's neither here nor there, as the correct answer to your question does not involve writing any regexes.

    0 讨论(0)
  • 2020-12-03 00:33

    Use BeautifulSoup! It's perfect for this, where you have incoming markup of dubious virtue and need to get something reasonable out of it. Just pass in the original text, extract all the string tags, and join them.

    0 讨论(0)
  • 2020-12-03 00:35

    While I agree with Lucas that regular expressions are not all that scary, I still think that you should go with a specialized HTML parser. This is because the HTML standard is hairy enough (especially if you want to parse arbitrarily "HTML" pages taken off the Internet) that you would need to write a lot of code to handle the corner cases. It seems that python includes one out of the box.

    You should also check out the python bindings for TidyLib which can clean up broken HTML, making the success rate of any HTML parsing much higher.

    0 讨论(0)
  • 2020-12-03 00:37

    if you use django you might also use http://docs.djangoproject.com/en/dev/ref/templates/builtins/#striptags ;)

    0 讨论(0)
提交回复
热议问题