Python string operation, extract text between html tags

后端未结

关注

 6  1542

I have a string:

  
JUL 28

(it outputs over two lines, so there must

相关标签:

6条回答

长情又很酷

2020-12-03 13:15
You have a bunch of options here. You could go for an all-out xml parser like lxml, though you seem to want a domain-specific solution. I'd go with a multiline regex:
```
import re
rex = re.compile(r'<font.*?>(.*?)',re.S|re.M)
...
data = """ 
JUL 28 """

match = rex.match(data)
if match:
 text = match.groups()[0].strip()
```
Now that you have text, you can turn it into a date pretty easily:
```
from datetime import datetime
date = datetime.strptime(text, "%b %d")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2020-12-03 13:15
Is grep an option?
```
grep "<[^>]*>(.*)<\/[^>]*>" file
```
The (.*) should match your content.
0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-12-03 13:19

Use Scrapy's XPath selectors as documented at http://doc.scrapy.org/en/0.10.3/topics/selectors.html

Alternatively you can utilize an HTML parser such as BeautifulSoup especially if want to operate on the document in an object oriented manner.

http://pypi.python.org/pypi/BeautifulSoup/3.2.0

0 讨论(0)
发布评论:

提交评论
- 加载中...
野性不改

2020-12-03 13:23
While it may be possible to parse arbitrary HTML with regular expressions, it's often a death trap. There are great tools out there for parsing HTML, including BeautifulSoup, which is a Python lib that can handle broken as well as good HTML fairly well.
```
>>> from BeautifulSoup import BeautifulSoup as BSHTML
>>> BS = BSHTML("""
... 
... JUL 28 """
... )
>>> BS.font.contents[0].strip()
u'JUL 28'
```
Then you just need to parse the date:
```
>>> datetime.strptime(BS.font.contents[0].strip(), '%b %d')
>>> datetime.datetime(1900, 7, 28, 0, 0)
datetime.datetime(1900, 7, 28, 0, 0)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
栀梦

2020-12-03 13:23

Python has a library called HTMLParser. Also see the following question posted in SO which is very similar to what you are looking for:

How can I use the python HTMLParser library to extract data from a specific div tag?

0 讨论(0)
发布评论:

提交评论
- 加载中...
慢半拍i

2020-12-03 13:35

Or, you could simply use Beautiful Soup:

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping

0 讨论(0)
发布评论:

提交评论
- 加载中...