I\'m trying to look at a html file and remove all the tags from it so that only the text is left but I\'m having a problem with my regex. This is what I have so far.
import re patjunk = re.compile("<.*?>| |&",re.DOTALL|re.M) url="http://www.yahoo.com" def test(url,pat): html = urllib2.urlopen(url).read() return pat.sub("",html) print test(url,patjunk)