I would like a Python regular expression that matches a given word that\'s not between simple quotes. I\'ve tried to use the (?! ...)
but without success.
A regex solution below will work in most cases, but it might break if the unbalanced single quotes appear outside of string literals, e.g. in comments.
A usual regex trick to match strings in-context is matching what you need to replace and match and capture what you need to keep.
Here is a sample Python demo:
import re
rx = r"('[^'\\]*(?:\\.[^'\\]*)*')|\b{0}\b"
s = r"""
var foe = 10;
foe = "";
dark_vador = 'bad guy'
foe = ' I\'m your father, foe ! '
bar = thingy + foe"""
toReplace = "foe"
res = re.sub(rx.format(toReplace), lambda m: m.group(1) if m.group(1) else 'NEWORD', s)
print(res)
See the Python demo
The regex will look like
('[^'\\]*(?:\\.[^'\\]*)*')|\bfoe\b
See the regex demo.
The ('[^'\\]*(?:\\.[^'\\]*)*')
part captures ingle-quoted string literals into Group 1 and if it matches, it is just put back into the result, and \bfoe\b
matches whole words foe
in any other string context - and subsequently is replaced with another word.
NOTE: To also match double quoted string literals, use r"('[^'\\]*(?:\\.[^'\\]*)*'|\"[^\"\\]*(?:\\.[^\"\\]*)*\")"
.
Capture group 1 of the following regular expression will contain matches of 'foe'
.
r'^(?:[^'\n]|\\')*(?:(?<!\\)'(?:[^'\n]|\\')*(?:(?<!\\)')(?:[^'\n]|\\')*)*\b(foe)\b'
Start your engine!
Python's regex engine performs the following operations.
^ : assert beginning of string
(?: : begin non-capture group
[^'\n] : match any char other than single quote and line terminator
| : or
\\' : match '\' then a single quote
) : end non-capture group
* : execute non-capture group 0+ times
(?: : begin non-capture group
(?<!\\) : next char is not preceded by '\' (negative lookbehind)
' : match single quote
(?: : begin non-capture group
[^'\n] : match any char other than single quote and line terminator
| : or
\\' : match '\' then a single quote
) : end non-capture group
* : execute non-capture group 0+ times
(?: : begin non-capture group
(?<!\\) : next char is not preceded by '\' (negative lookbehind)
' : match single quote
) : end non-capture group
(?: : begin non-capture group
[^'\n] : match any char other than single quote and line terminator
| : or
\\' : match '\' then a single quote
) : end non-capture group
* : execute non-capture group 0+ times
) : end non-capture group
* : execute non-capture group 0+ times
\b(foe)\b : match 'foe' in capture group 1
How about this regular expression:
>>> s = '''var foe = 10;
foe = "";
dark_vador = 'bad guy'
' I\m your father, foe ! '
bar = thingy + foe'''
>>>
>>> re.findall(r'(?!\'.*)foe(?!.*\')', s)
['foe', 'foe', 'foe']
The trick here is to make sure the expression does not match any string with leading and trailing '
and to remember to account for the characters in between, thereafter .*
in the re expression.
((?!\'[\w\s]*[\\']*[\w\s]*)foe(?![\w\s]*[\\']*[\w\s]*\'))
You can try this:-
((?!\'[\w\s]*)foe(?![\w\s]*\'))