Strip HTML from strings in Python

前端 未结 26 2312
难免孤独
难免孤独 2020-11-22 02:50
from mechanize import Browser
br = Browser()
br.open(\'http://somewebpage\')
html = br.response().readlines()
for line in html:
  print line

When p

相关标签:
26条回答
  • 2020-11-22 03:16

    The Beautiful Soup package does this immediately for you.

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html)
    text = soup.get_text()
    print(text)
    
    0 讨论(0)
  • 2020-11-22 03:16

    You can use either a different HTML parser (like lxml, or Beautiful Soup) -- one that offers functions to extract just text. Or, you can run a regex on your line string that strips out the tags. See Python docs for more.

    0 讨论(0)
  • 2020-11-22 03:16

    I have used Eloff's answer successfully for Python 3.1 [many thanks!].

    I upgraded to Python 3.2.3, and ran into errors.

    The solution, provided here thanks to the responder Thomas K, is to insert super().__init__() into the following code:

    def __init__(self):
        self.reset()
        self.fed = []
    

    ... in order to make it look like this:

    def __init__(self):
        super().__init__()
        self.reset()
        self.fed = []
    

    ... and it will work for Python 3.2.3.

    Again, thanks to Thomas K for the fix and for Eloff's original code provided above!

    0 讨论(0)
  • 2020-11-22 03:19

    If you need to preserve HTML entities (i.e. &), I added "handle_entityref" method to Eloff's answer.

    from HTMLParser import HTMLParser
    
    class MLStripper(HTMLParser):
        def __init__(self):
            self.reset()
            self.fed = []
        def handle_data(self, d):
            self.fed.append(d)
        def handle_entityref(self, name):
            self.fed.append('&%s;' % name)
        def get_data(self):
            return ''.join(self.fed)
    
    def html_to_text(html):
        s = MLStripper()
        s.feed(html)
        return s.get_data()
    
    0 讨论(0)
  • 2020-11-22 03:19

    A python 3 adaption of søren-løvborg's answer

    from html.parser import HTMLParser
    from html.entities import html5
    
    class HTMLTextExtractor(HTMLParser):
        """ Adaption of http://stackoverflow.com/a/7778368/196732 """
        def __init__(self):
            super().__init__()
            self.result = []
    
        def handle_data(self, d):
            self.result.append(d)
    
        def handle_charref(self, number):
            codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
            self.result.append(unichr(codepoint))
    
        def handle_entityref(self, name):
            if name in html5:
                self.result.append(unichr(html5[name]))
    
        def get_text(self):
            return u''.join(self.result)
    
    def html_to_text(html):
        s = HTMLTextExtractor()
        s.feed(html)
        return s.get_text()
    
    0 讨论(0)
  • 2020-11-22 03:20

    An lxml.html-based solution (lxml is a native library and can be more performant than a pure python solution).

    Remove ALL tags

    from lxml import html
    
    
    ## from file-like object or URL
    tree = html.parse(file_like_object_or_url)
    
    ## from string
    tree = html.fromstring('safe <script>unsafe</script> safe')
    
    print(tree.text_content().strip())
    
    ### OUTPUT: 'safe unsafe safe'
    
    

    Remove ALL tags with pre-sanitizing HTML (dropping some tags)

    from lxml import html
    from lxml.html.clean import clean_html
    
    tree = html.fromstring("""<script>dangerous</script><span class="item-summary">
                                Detailed answers to any questions you might have
                            </span>""")
    
    ## text only
    print(clean_html(tree).text_content().strip())
    
    ### OUTPUT: 'Detailed answers to any questions you might have'
    

    Also see http://lxml.de/lxmlhtml.html#cleaning-up-html for what exactly the lxml.cleaner does.

    If you need more control over what exactly is sanitized before converting to text then you might want to use the lxml Cleaner explicitly by passing the options you want in the constructor, e.g:

    cleaner = Cleaner(page_structure=True,
                      meta=True,
                      embedded=True,
                      links=True,
                      style=True,
                      processing_instructions=True,
                      inline_style=True,
                      scripts=True,
                      javascript=True,
                      comments=True,
                      frames=True,
                      forms=True,
                      annoying_tags=True,
                      remove_unknown_tags=True,
                      safe_attrs_only=True,
                      safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
                      remove_tags=('span', 'font', 'div')
                      )
    sanitized_html = cleaner.clean_html(unsafe_html)
    

    If you need more control over how plain text is generated then instead of text_content() you can use lxml.etree.tostring:

    plain_bytes = tostring(tree, method='text', encoding='utf-8')
    print(plain.decode('utf-8'))
    
    
    0 讨论(0)
提交回复
热议问题