Using python to split a string with delimiter, while ignoring the delimiter and escape quotes inside quotes

问题

I am trying to split a string based on the location of a delimiter (I am trying to remove comments from Fortran code). I can split using ! in the following string:

x = '''print "hi!" ! Remove me'''
pattern = '''(?:[^!"]|"[^"]*")+'''
y = re.search(pattern, x)

However, this fails if the string contains escape quotes, e.g.

z = '''print "h\"i!" ! Remove me'''

Can the regex be modified to handle escape quotes? Or should I not even be using regexps for this sort of problem?

回答1:

Here's a proven regex (from Mastering Regular Expressions) for matching double-quoted string literals which may contain backslash-escaped quotes:

r'"[^"\\]*(?:\\.[^"\\]*)*"'

Within the delimiting quotes, it consumes any pair of characters that starts with a backslash without bothering to identify the second character; that allows it to handle escaped backslashes and other escape sequences with no extra hassle. It's also as efficient as can be in the absence of possessive quantifiers and atomic groups, which aren't supported by Python.

The full regex for your application would be:

r'^((?:[^!"]+|"[^"\\]*(?:\\.[^"\\]*)*")*)!.*$'

This matches only lines that contain comments, and captures everything preceding the comment in group #1. The capture may be zero-length, for lines that start with !. This regex is intended for use with sub rather than search, as shown here:

import re

pattern = r'^((?:[^!"]+|"[^"\\]*(?:\\.[^"\\]*)*")*)!.*$'

x = '''print "hi!" ! Remove me'''
y = re.sub(pattern, r'\1', x)
print(y)

See it in action on ideone.com

DISCLAIMER: This answer is not about FORTRAN, only about code that follows the rules specified in the question. I've never worked with FORTRAN, and every reference I've found in the last hour or so seems to describe a completely different language. Meh!

回答2:

Fortran parsing is actually quite tricky (see e.g. a thread here). I am blissfully unfamiliar with the details of the syntax, and where '!' might occur. So here is a thought: how likely is it that the comments themselves include '!' ? If it is not very likely, you might simply remove everything after the last '!' in each line:

def cleanup(line):
  splitlist = line.split("!")
  if len(splitlist) > 1 and "\"" not in splitlist[-1]:
      return '!'.join(splitlist[:-1]).strip()
  else:
      return line

This is not perfect, but at worst, you will end up leaving some partial comments. This should never affect actual code.

Edit:

Looks like NumPy includes a python-based Fortran parser in the F2py package. Depending on licensing constraints, you may be able to rework that to reliably parse 'code but not comments.'

回答3:

What you need is a negative lookbehind assertion: (?<!...).

For example:

z = r'''print "h\"i!" ! Remove me'''
pattern = r'''(?:[^!"]|(?<!\\)".*(?<!\\)")+'''
y = re.search(pattern, z)

print(y.group(0))

Output:

print "h\"i!"

As pointed out in the comments, the expression above will not handle escaped backslashes. Also it will not handle single quotes which are allowed in FORTRAN. This one should work for those cases as well (I think):

 pattern = r'''(?:[^!"']|((?<!\\)"|(\\\\)+").*?((?<!\\)"|(\\\\)+")|((?<!\\)'|(\\\\)+').*?((?<!\\)"|(\\\\)+'))+'''

This is getting a little ugly . . .

来源：https://stackoverflow.com/questions/5150398/using-python-to-split-a-string-with-delimiter-while-ignoring-the-delimiter-and

标签

python

regex

delimiter