I have a string:
JUL 28
(it outputs over two lines, so there must
You have a bunch of options here. You could go for an all-out xml parser like lxml, though you seem to want a domain-specific solution. I'd go with a multiline regex:
import re
rex = re.compile(r'<font.*?>(.*?)</font>',re.S|re.M)
...
data = """<font face="ARIAL,HELVETICA" size="-2">
JUL 28 </font>"""
match = rex.match(data)
if match:
text = match.groups()[0].strip()
Now that you have text
, you can turn it into a date pretty easily:
from datetime import datetime
date = datetime.strptime(text, "%b %d")
Is grep an option?
grep "<[^>]*>(.*)<\/[^>]*>" file
The (.*) should match your content.
Use Scrapy's XPath selectors as documented at http://doc.scrapy.org/en/0.10.3/topics/selectors.html
Alternatively you can utilize an HTML parser such as BeautifulSoup especially if want to operate on the document in an object oriented manner.
http://pypi.python.org/pypi/BeautifulSoup/3.2.0
While it may be possible to parse arbitrary HTML with regular expressions, it's often a death trap. There are great tools out there for parsing HTML, including BeautifulSoup, which is a Python lib that can handle broken as well as good HTML fairly well.
>>> from BeautifulSoup import BeautifulSoup as BSHTML
>>> BS = BSHTML("""
... <font face="ARIAL,HELVETICA" size="-2">
... JUL 28 </font>"""
... )
>>> BS.font.contents[0].strip()
u'JUL 28'
Then you just need to parse the date:
>>> datetime.strptime(BS.font.contents[0].strip(), '%b %d')
>>> datetime.datetime(1900, 7, 28, 0, 0)
datetime.datetime(1900, 7, 28, 0, 0)
Python has a library called HTMLParser. Also see the following question posted in SO which is very similar to what you are looking for:
How can I use the python HTMLParser library to extract data from a specific div tag?
Or, you could simply use Beautiful Soup:
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping