How do I ensure that re.findall() stops at the right place?

放肆的年华 提交于 2019-12-08 16:51:38

问题


Here is the code I have:

a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
re.findall(r'<(title)>(.*)<(/title)>', a)

The result is:

[('title', 'aaa</title><title>aaa2</title><title>aaa3', '/title')]

If I ever designed a crawler to get me titles of web sites, I might end up with something like this rather than a title for the web site.

My question is, how do I limit findall to a single <title></title>?


回答1:


Use re.search instead of re.findall if you only want one match:

>>> s = '<title>aaa</title><title>aaa2</title><title>aaa3</title>'
>>> import re
>>> re.search('<title>(.*?)</title>', s).group(1)
'aaa'

If you wanted all tags, then you should consider changing it to be non-greedy (ie - .*?):

print re.findall(r'<title>(.*?)</title>', s)
# ['aaa', 'aaa2', 'aaa3']     

But really consider using BeautifulSoup or lxml or similar to parse HTML.




回答2:


Use a non-greedy search instead:

r'<(title)>(.*?)<(/title)>'

The question-mark says to match as few characters as possible. Now your findall() will return each of the results you want.

http://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy




回答3:


re.findall(r'<(title)>(.*?)<(/title)>', a)

Add a ? after the *, so it will be non-greedy.




回答4:


It will be much easier using BeautifulSoup module.

https://pypi.python.org/pypi/beautifulsoup4



来源:https://stackoverflow.com/questions/17765805/how-do-i-ensure-that-re-findall-stops-at-the-right-place

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!