I would like to parse the HD price from the following snipper of HTML. I am only have fragments of the html code, so I cannot use an HTML parser for this.
You've asked for a regular expression here, but it's not the right tool for parsing HTML. Use BeautifulSoup for this.
>>> from bs4 import BeautifulSoup
>>> html = '''
<div id="left-stack">
<span>View In iTunes</span></a>
<span class="price">£19.99</span>
<ul class="list">
<li>HD Version</li>'''
>>> soup = BeautifulSoup(html)
>>> val = soup.find('span', {'class':'price'}).text
>>> print val[1:]
19.99
You can still parse using BeautifulSoup
, you don't need the full html:
from bs4 import BeautifulSoup
html="""
<div id="left-stack">
<span>View In iTunes</span></a>
<span class="price">£19.99</span>
<ul class="list">
<li>HD Version</li>
"""
soup = BeautifulSoup(html)
sp = soup.find(attrs={"class":"price"})
print sp.text[1:]
19.99
The current BeautifulSoup answers only show how to grab all <span class="price">
tags. This is better:
from bs4 import BeautifulSoup
soup = """<div id="left-stack">
<span>View In iTunes</span></a>
<span class="price">£19.99</span>
<ul class="list">
<li>HD Version</li>"""
for HD_Version in (tag for tag in soup('li') if tag.text.lower() == 'hd version'):
price = HD_Version.parent.findPreviousSibling('span', attrs={'class':'price'}).text
In general, using regular expressions to parse an irregular language like HTML is asking for trouble. Stick with an established parser.
BeautifulSoup is very lenient to the HTML it parses, you can use it for the chunks/parts of HTML too:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
data = u"""
<div id="left-stack">
<span>View In iTunes</span></a>
<span class="price">£19.99</span>
<ul class="list">
<li>HD Version</li>
"""
soup = BeautifulSoup(data)
print soup.find('span', class_='price').text[1:]
Prints:
19.99
You can use this regex:
\d+(?:\.\d+)?(?=\D+HD Version)
\D+
skips ahead of non-digits in a lookahead, effectively asserting that our match (19.99
) is the last digit ahead of HD Version
.Here is a regex demo.
Use the i
modifier in the regex to make the matching case-insensitive and change +
to*
if the number can be directly before HD Version
.