Strip HTML from strings in Python

前端 未结 26 2309
难免孤独
难免孤独 2020-11-22 02:50
from mechanize import Browser
br = Browser()
br.open(\'http://somewebpage\')
html = br.response().readlines()
for line in html:
  print line

When p

相关标签:
26条回答
  • 2020-11-22 03:05

    Short version!

    import re, cgi
    tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')
    
    # Remove well-formed tags, fixing mistakes by legitimate users
    no_tags = tag_re.sub('', user_input)
    
    # Clean up anything else by escaping
    ready_for_web = cgi.escape(no_tags)
    

    Regex source: MarkupSafe. Their version handles HTML entities too, while this quick one doesn't.

    Why can't I just strip the tags and leave it?

    It's one thing to keep people from <i>italicizing</i> things, without leaving is floating around. But it's another to take arbitrary input and make it completely harmless. Most of the techniques on this page will leave things like unclosed comments (<!--) and angle-brackets that aren't part of tags (blah <<<><blah) intact. The HTMLParser version can even leave complete tags in, if they're inside an unclosed comment.

    What if your template is {{ firstname }} {{ lastname }}? firstname = '<a' and lastname = 'href="http://evil.com/">' will be let through by every tag stripper on this page (except @Medeiros!), because they're not complete tags on their own. Stripping out normal HTML tags is not enough.

    Django's strip_tags, an improved (see next heading) version of the top answer to this question, gives the following warning:

    Absolutely NO guarantee is provided about the resulting string being HTML safe. So NEVER mark safe the result of a strip_tags call without escaping it first, for example with escape().

    Follow their advice!

    To strip tags with HTMLParser, you have to run it multiple times.

    It's easy to circumvent the top answer to this question.

    Look at this string (source and discussion):

    <img<!-- --> src=x onerror=alert(1);//><!-- -->
    

    The first time HTMLParser sees it, it can't tell that the <img...> is a tag. It looks broken, so HTMLParser doesn't get rid of it. It only takes out the <!-- comments -->, leaving you with

    <img src=x onerror=alert(1);//>
    

    This problem was disclosed to the Django project in March, 2014. Their old strip_tags was essentially the same as the top answer to this question. Their new version basically runs it in a loop until running it again doesn't change the string:

    # _strip_once runs HTMLParser once, pulling out just the text of all the nodes.
    
    def strip_tags(value):
        """Returns the given HTML with all tags stripped."""
        # Note: in typical case this loop executes _strip_once once. Loop condition
        # is redundant, but helps to reduce number of executions of _strip_once.
        while '<' in value and '>' in value:
            new_value = _strip_once(value)
            if len(new_value) >= len(value):
                # _strip_once was not able to detect more tags
                break
            value = new_value
        return value
    

    Of course, none of this is an issue if you always escape the result of strip_tags().

    Update 19 March, 2015: There was a bug in Django versions before 1.4.20, 1.6.11, 1.7.7, and 1.8c1. These versions could enter an infinite loop in the strip_tags() function. The fixed version is reproduced above. More details here.

    Good things to copy or use

    My example code doesn't handle HTML entities - the Django and MarkupSafe packaged versions do.

    My example code is pulled from the excellent MarkupSafe library for cross-site scripting prevention. It's convenient and fast (with C speedups to its native Python version). It's included in Google App Engine, and used by Jinja2 (2.7 and up), Mako, Pylons, and more. It works easily with Django templates from Django 1.7.

    Django's strip_tags and other html utilities from a recent version are good, but I find them less convenient than MarkupSafe. They're pretty self-contained, you could copy what you need from this file.

    If you need to strip almost all tags, the Bleach library is good. You can have it enforce rules like "my users can italicize things, but they can't make iframes."

    Understand the properties of your tag stripper! Run fuzz tests on it! Here is the code I used to do the research for this answer.

    sheepish note - The question itself is about printing to the console, but this is the top Google result for "python strip html from string", so that's why this answer is 99% about the web.

    0 讨论(0)
  • 2020-11-22 03:05

    Simple code!. This will remove all kind of tags and content inside of it.

    def rm(s):
        start=False
        end=False
        s=' '+s
        for i in range(len(s)-1):
            if i<len(s):
                if start!=False:
                    if s[i]=='>':
                        end=i
                        s=s[:start]+s[end+1:]
                        start=end=False
                else:
                    if s[i]=='<':
                        start=i
        if s.count('<')>0:
            self.rm(s)
        else:
            s=s.replace('&nbsp;', ' ')
            return s
    

    But it won't give full result if text contains <> symbols inside it.

    0 讨论(0)
  • 2020-11-22 03:07

    Here is a simple solution that strips HTML tags and decodes HTML entities based on the amazingly fast lxml library:

    from lxml import html
    
    def strip_html(s):
        return str(html.fromstring(s).text_content())
    
    strip_html('Ein <a href="">sch&ouml;ner</a> Text.')  # Output: Ein schöner Text.
    
    0 讨论(0)
  • 2020-11-22 03:08

    Here's a solution similar to the currently accepted answer (https://stackoverflow.com/a/925630/95989), except that it uses the internal HTMLParser class directly (i.e. no subclassing), thereby making it significantly more terse:

    def strip_html(text):
        parts = []                                                                      
        parser = HTMLParser()                                                           
        parser.handle_data = parts.append                                               
        parser.feed(text)                                                               
        return ''.join(parts)
    
    0 讨论(0)
  • 2020-11-22 03:08

    This method works flawlessly for me and requires no additional installations:

    import re
    import htmlentitydefs
    
    def convertentity(m):
        if m.group(1)=='#':
            try:
                return unichr(int(m.group(2)))
            except ValueError:
                return '&#%s;' % m.group(2)
            try:
                return htmlentitydefs.entitydefs[m.group(2)]
            except KeyError:
                return '&%s;' % m.group(2)
    
    def converthtml(s):
        return re.sub(r'&(#?)(.+?);',convertentity,s)
    
    html =  converthtml(html)
    html.replace("&nbsp;", " ") ## Get rid of the remnants of certain formatting(subscript,superscript,etc).
    
    0 讨论(0)
  • 2020-11-22 03:09
    # This is a regex solution.
    import re
    def removeHtml(html):
      if not html: return html
      # Remove comments first
      innerText = re.compile('<!--[\s\S]*?-->').sub('',html)
      while innerText.find('>')>=0: # Loop through nested Tags
        text = re.compile('<[^<>]+?>').sub('',innerText)
        if text == innerText:
          break
        innerText = text
    
      return innerText.strip()
    
    0 讨论(0)
提交回复
热议问题