Ok, so i\'m working on a regular expression to search out all the header information in a site.
I\'ve compiled the regular expression:
regex = re.compile
As has been mentioned, you should use a parser instead of a regex.
This is how you could do it with a regex though:
import re
html = '''
Dog
Cat
Fancy
Tall cup of lemons
Dog thing
'''
p = re.compile(r'''
<(?Ph[0-9])> # store header tag for later use
\s* # zero or more whitespace
(.*?)">)? # optional link tag. store href portion
\s*
(?P.*?) # title
\s*
( )? # optional closing link tag
\s*
(?P=header)> # must match opening header tag
''', re.IGNORECASE + re.VERBOSE)
stories = p.finditer(html)
for match in stories:
print '%(title)s [%(href)s]' % match.groupdict()
Here are a couple of good regular expression resources: