Ok, so i\'m working on a regular expression to search out all the header information in a site.
I\'ve compiled the regular expression:
regex = re.compile
I have used beautifulsoup to parse your desired HTML. I have the above HTML code in a file called foo.html and later read as a file object.
from BeautifulSoup import BeautifulSoup
H_TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']
def extract_data():
"""Extract the data from all headers
in a HTML page."""
f = open('foo.html', 'r+')
html = f.read()
soup = BeautifulSoup(html)
headers = [soup.findAll(h) for h in H_TAGS if soup.findAll(h)]
lst = []
for x in headers:
for y in x:
if y.string:
lst.append(y.string)
else:
lst.append(y.contents[0].string)
return lst
The above function returns:
>>> [u'Dog ', u'Tall cup of lemons', u'Dog thing', u'Cat ', u'Fancy ']
You can add any number of header tags in h_tags list. I have assumed all the headers. If you can solve things easily using BeautifulSoup then its better to use it. :)