Removing html tags from a text using Regular Expression in python

前端 未结 2 769
感动是毒
感动是毒 2021-01-18 08:55

I\'m trying to look at a html file and remove all the tags from it so that only the text is left but I\'m having a problem with my regex. This is what I have so far.

2条回答
  •  粉色の甜心
    2021-01-18 09:43

    import re
    patjunk = re.compile("<.*?>| |&",re.DOTALL|re.M)
    url="http://www.yahoo.com"
    def test(url,pat):
        html = urllib2.urlopen(url).read()
        return pat.sub("",html)
    
    print test(url,patjunk)
    

提交回复
热议问题