Problem:
I have thousands of documents which contains a specific character I don\'t want. E.g. the character a
. These documents contain a variety of characters, but
If you can use Visual Studio (instead of Visual Studio Code), it is written in C++ and C# and uses the .NET Framework regular expressions, which means you can use variable length lookbehinds to accomplish this.
(?<="[^"\n]*)a(?=[^"\n]*")
Adding some more logic to the above regular expression, we can tell it to ignore any locations where there are an even amount of "
preceding it. This prevents matches for a
outside of quotes. Take, for example, the string "a" a "a"
. Only the first and last a
in this string will be matched, but the one in the middle will be ignored.
(?
Now the only problem is this will break if we have escaped "
within two double quotes such as "a\"" a "a"
. We need to add more logic to prevent this behaviour. Luckily, this beautiful answer exists for properly matching escaped "
. Adding this logic to the regex above, we get the following:
(?
I'm not sure which method works best with your strings, but I'll explain this last regex in detail as it also explains the two previous ones.
(? Negative lookbehind ensuring what precedes doesn't match the following
^
Assert position at the start of the line
[^"\n]*
Match anything except "
or \n
any number of times
(?:(?:"(?:[^"\\\n]|\\.)*){2})+
Match the following one or more times. This ensures if there are any "
preceding the match that they are balanced in the sense that there is an opening and closing double quote.
(?:"(?:[^"\\\n]|\\.)*){2}
Match the following exactly twice
"
Match this literally
(?:[^"\\\n]|\\.)*
Match either of the following any number of times
[^"\\\n]
Match anything except "
, \
and \n
\\.
Matches \
followed by any character
(?<="[^"\n]*)
Positive lookbehind ensuring what precedes matches the following
"
Match this literally[^"\n]*
Match anything except "
or \n
any number of timesa
Match this literally(?=[^"\n]*")
Positive lookahead ensuring what follows matches the following
[^"\n]*
Match anything except "
or \n
any number of times"
Match this literallyYou can drop the \n
from the above pattern as the following suggests. I added it just in case there's some sort of special cases I'm not considering (i.e. comments) that could break this regex within your text. The \A
also forces the regex to match from the start of the string (or file) instead of the start of the line.
(?
You can test this regex here
This is what it looks like in Visual Studio: