Removing html tags from a text using Regular Expression in python

前端 未结 2 771
感动是毒
感动是毒 2021-01-18 08:55

I\'m trying to look at a html file and remove all the tags from it so that only the text is left but I\'m having a problem with my regex. This is what I have so far.

相关标签:
2条回答
  • 2021-01-18 09:41

    Use BeautifulSoup. Use lxml. Do not use regular expressions to parse HTML.


    Edit 2010-01-29: This would be a reasonable starting point for lxml:

    from lxml.html import fromstring
    from lxml.html.clean import Cleaner
    import requests
    
    url = "https://stackoverflow.com/questions/2165943/removing-html-tags-from-a-text-using-regular-expression-in-python"
    html = requests.get(url).text
    
    doc = fromstring(html)
    
    tags = ['h1','h2','h3','h4','h5','h6',
           'div', 'span', 
           'img', 'area', 'map']
    args = {'meta':False, 'safe_attrs_only':False, 'page_structure':False, 
           'scripts':True, 'style':True, 'links':True, 'remove_tags':tags}
    cleaner = Cleaner(**args)
    
    path = '/html/body'
    body = doc.xpath(path)[0]
    
    print cleaner.clean_html(body).text_content().encode('ascii', 'ignore')
    

    You want the content, so presumably you don't want any javascript or CSS. Also, presumably you want only the content in the body and not HTML from the head, too. Read up on lxml.html.clean to see what you can easily strip out. Way smarter than regular expressions, no?

    Also, watch out for unicode encoding problems. You can easily end up with HTML that you cannot print.


    2012-11-08: changed from using urllib2 to requests. Just use requests!

    0 讨论(0)
  • 2021-01-18 09:43
    import re
    patjunk = re.compile("<.*?>|&nbsp;|&amp;",re.DOTALL|re.M)
    url="http://www.yahoo.com"
    def test(url,pat):
        html = urllib2.urlopen(url).read()
        return pat.sub("",html)
    
    print test(url,patjunk)
    
    0 讨论(0)
提交回复
热议问题