strange behavior of parenthesis in python regex

时光怂恿深爱的人放手 提交于 2019-12-02 17:04:28

问题


I'm writing a python regex that looks through a text document for quoted strings (quotes of airline pilots recorded from blackboxes). I started by trying to write a regex with the following rules:

Return what is between quotes.
if it opens with single, only return if it closes with single.
if it opens with double, only return if it closes with double.

For instance I don't want to match "hi there', or 'hi there", but "hi there" and 'hi there'.

I use a testing page which contains things like:

CA  "Runway 18, wind 230 degrees, five knots, altimeter 30."
AA  "Roger that"
18:24:10 [flap lever moving into detent]
ST: "Some passenger's pushing a switch. May I?"

So I decided to start simple:

 re.findall('("|\').*?\\1', page)
 ########## /("|').*?\1/ <-- raw regex I think I'm going for.

This regex acts very unexpectedly.
I thought it would:

  1. ( " | " ) Match EITHER single OR double quotes, save as back reference /1.
  2. .*? Match non-greedy wildcard.
  3. \1 Match whatever it finds in back reference \1 (step one).

Instead, it returns an array of quotes but never anything else.

['"', '"', "'", "'"]

I'm really confused because the equivalent (afaik) regex works just fine in VIM.

\("\|'\).\{-}\1/)

My question is this:
Why does it return only what is inside parenthesis as the match? Is this a flaw in my understanding of back references? If so then why does it work in VIM?

And how do I write the regex I'm looking for in python?

Thank you for your help!


回答1:


Read the documentation. re.findall returns the groups, if there are any. If you want the entire match you must group it all, or use re.finditer. See this question.




回答2:


You aren't capturing anything except for the quotes, which is what Python is returning.

If you add another group, things work much better:

for quote, match in re.finditer(r'("|\')(.*?)\1', page):
  print match

I prefixed your string literal with an r to make it a raw string, which is useful when you need to use a ton of backslashes (\\1 becomes \1).




回答3:


You need to catch everything with an extra pair of parentheses.

re.findall('(("|\').*?\\2)', page)


来源:https://stackoverflow.com/questions/11703573/strange-behavior-of-parenthesis-in-python-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!