Python re - escape coincidental parentheses in regex pattern

流过昼夜 提交于 2019-11-28 12:31:31

问题


I am having trouble with the regex in the following code:

import mechanize
import re

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
response = br.open("http://www.gfsc.gg/The-Commission/Pages/Regulated-Entities.aspx?auto_click=1")

html = response.read()
br.select_form(nr=0)
#print br.form
br.set_all_readonly(False)
next = re.search(r"""<a href="javascript:__doPostBack('(.*?)','(.*?)')">""",html)

if next:
    print 'group(1):', next.group(1)
    print 'group(2):', next.group(2) 

If the single quotes around both instances of (.*?) are removed from the regex, these are the results:

group(1): ('ctl00$ctl20$g_af5ce308_e786_4734_ad4c_9829087cffbd$ctl00$gvWebLicensee','Page$2')
group(2): ('ctl00$ctl20$g_af5ce308_e786_4734_ad4c_9829087cffbd$ctl00$gvWebLicensee'

These results are not quite right. The parentheses and single quotes need to be removed (not my question) and I would like group(1) and group(2) to look like this:

group(1): ctl00$ctl20$g_af5ce308_e786_4734_ad4c_9829087cffbd$ctl00$gvWebLicensee
group(2): Page$2

回答1:


You need to escape the parenthesis since they have a special meaning:

<a href="javascript:__doPostBack\('(.*?)','(.*?)'\)">
                             HERE^            HERE^

Note that, ideally, you should not be parsing HTML with regex (even though your pattern is quite specific and I don't think this is that bad). Instead, parse HTML with, say, BeautifulSoup, locate the a element, get the href attribute value and then extract the desired substrings with regex.



来源:https://stackoverflow.com/questions/39254333/python-re-escape-coincidental-parentheses-in-regex-pattern

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!