How to remove all html tags from downloaded page

后端 未结 7 1969
鱼传尺愫
鱼传尺愫 2020-12-31 17:32

I have downloaded a page using urlopen. How do I remove all html tags from it? Is there any regexp to replace all <*> tags?

相关标签:
7条回答
  • 2020-12-31 18:12

    Try this:

    import re
    
    def remove_html_tags(data):
      p = re.compile(r'<.*?>')
      return p.sub('', data)
    
    0 讨论(0)
  • 2020-12-31 18:18

    If you need HTML parsing, Python has a module for you!

    0 讨论(0)
  • 2020-12-31 18:19

    There's a great python library called bleach. This call below will remove all html tags, leaving everything else (but not removing the content inside tags that are not visible).

    bleach.clean(thestring, tags=[], attributes={}, styles=[], strip=True)
    
    0 讨论(0)
  • 2020-12-31 18:24

    There are multiple options to filter out Html tags from data. you can use Regex or remove_tags from w3lib which is in-built in python.

    from w3lib.html import remove_tags
    data_to_remove = '<p>hello\t\t, \tworld\n</p>'
    print remove_tags(data_to_remove)`
    

    OUTPUT: hello-world

    Note: remove_tags accept string object. you can pass remove_tags(str(data_to_remove))

    0 讨论(0)
  • 2020-12-31 18:29

    I can also recommend BeautifulSoup which is an easy to use html parser. There you would do something like:

    from BeautifulSoup import BeautifulSoup
    
    soup = BeautifulSoup(html)
    all_text = ''.join(soup.findAll(text=True))
    

    This way you get all the text from a html document.

    0 讨论(0)
  • 2020-12-31 18:29

    You could use html2text which is supposed to make a readable text equivalent from an HTML source (programatically with Python or as a command-line tool). Thus I may extrapolate your needs from your question...

    0 讨论(0)
提交回复
热议问题