Strip HTML from strings in Python

前端 未结 26 2315
难免孤独
难免孤独 2020-11-22 02:50
from mechanize import Browser
br = Browser()
br.open(\'http://somewebpage\')
html = br.response().readlines()
for line in html:
  print line

When p

26条回答
  •  盖世英雄少女心
    2020-11-22 02:57

    For one project, I needed so strip HTML, but also css and js. Thus, I made a variation of Eloffs answer:

    class MLStripper(HTMLParser):
        def __init__(self):
            self.reset()
            self.strict = False
            self.convert_charrefs= True
            self.fed = []
            self.css = False
        def handle_starttag(self, tag, attrs):
            if tag == "style" or tag=="script":
                self.css = True
        def handle_endtag(self, tag):
            if tag=="style" or tag=="script":
                self.css=False
        def handle_data(self, d):
            if not self.css:
                self.fed.append(d)
        def get_data(self):
            return ''.join(self.fed)
    
    def strip_tags(html):
        s = MLStripper()
        s.feed(html)
        return s.get_data()
    

提交回复
热议问题