Problem:
I have thousands of documents which contains a specific character I don\'t want. E.g. the character a
. These documents contain a variety of characters, but
VS Code uses JavaScript RegEx engine for its find / replace functionality. This means you are very limited in working with regex in comparison to other flavors like .NET or PCRE.
Lucky enough that this flavor supports lookaheads and with lookaheads you are able to look for but not consume character. So one way to ensure that we are within a quoted string is to look for number of quotes down to bottom of file / subject string to be odd after matching an a
:
a(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$)
Live demo
This looks for a
s in a double quoted string, to have it for single quoted strings substitute all "
s with '
. You can't have both at a time.
There is a problem with regex above however, that it conflicts with escaped double quotes within double quoted strings. To match them too if it matters you have a long way to go:
a(?=[^"\\]*(?:\\.[^"\\]*)*"[^"\\]*(?:\\.[^"\\]*)*(?:"[^"\\]*(?:\\.[^"\\]*)*"[^"\\]*(?:\\.[^"\\]*)*)*$)
Applying these approaches on large files probably will result in an stack overflow so let's see a better approach.
I am using VSCode, but I'm open to any suggestions.
That's great. Then I'd suggest to use awk
or sed
or something more programmatic in order to achieve what you are after or if you are able to use Sublime Text a chance exists to work around this problem in a more elegant way.
This is supposed to work on large files with hundred of thousands of lines but care that it works for a single character (here a
) that with some modifications may work for a word or substring too:
Search for:
(?:"|\G(?[^a"\\]*+(?>\\.[^a"\\]*)*+)\K(a|"(*SKIP)(*F))(?(?=((?&r)"))\3)
^ ^ ^
Replace it with: WHATEVER\3
Live demo
RegEx Breakdown:
(?: # Beginning of non-capturing group #1
" # Match a `"`
| # Or
\G(? # Start of capturing group `r`
[^a"\\]*+ # Match anything except `a`, `"` or a backslash (possessively)
(?>\\.[^a"\\]*)*+ # Match an escaped character or
# repeat last pattern as much as possible
)\K # End of CG `r`, reset all consumed characters
( # Start of CG #2
a # Match literal `a`
| # Or
"(*SKIP)(*F) # Match a `"` and skip over current match
)
(?(?= # Start a conditional cluster, assuming a positive lookahead
((?&r)") # Start of CG #3, recurs CG `r` and match `"`
) # End of condition
\3 # If conditional passed match CG #3
) # End of conditional
Last but not least...
Matching a character inside quotation marks is tricky since delimiters are exactly the same so opening and closing marks can not be distinguished from each other without taking a look at adjacent strings. What you can do is change a delimiter to something else so that you can look for it later.
Search for: "[^"\\]*(?:\\.[^"\\]*)*"
Replace with: $0Я
Search for: a(?=[^"\\]*(?:\\.[^"\\]*)*"Я)
Replace with whatever you expect.
Search for: "Я
Replace with nothing to revert every thing.