Ok, so i\'m working on a regular expression to search out all the header information in a site.
I\'ve compiled the regular expression:
regex = re.compile
Parsing things with regular expressions works for regular languages. HTML is not a regular language, and the stuff you find on web pages these days is absolute crap. BeautifulSoup deals with tag-soup HTML with browser-like heuristics so you get parsed HTML that resembles what a browser would display.
The downside is it's not very fast. There's lxml for parsing well-formed html, but you should really use BeautifulSoup if you're not 100% certain that your input will always be well-formed.