I am trying to find all VBA comments using regular expressions. I have something that mostly works, but there are a few exceptions that I cannot figure out.
Expression I
Maybe something like
^(?:[^"'\n]*("(?:[^"\n]|"")*"))*[^"]*'(.*)$
It handles multiple quoted strings, as well as strings having quoted (double) "
's (which I believe is VBA's way).
(I guarantee it will fail in some cases, but probably will work in most ;)
Check it out here at regex101.
Edit
Added some of Comintern's examples and adjusted the regex. It still can't handle the bracketed identifiers though (which I don't even know what it means :S See the last line). But it now handles his continued line comments.
^(?:[^"'\n]*(?:"(?:[^"\n]|"")*"))*[^']*('(?:_\n|.)*)
Check it out here at regex101.
This should work:
("[^"]+"\s)?'.+
Tested here: https://regex101.com/r/dd60QS/1
You can't find all of the comments (let alone string literals) in VBA code with regular expressions - period. Trust me, I tried during work on the Smart Indenter module of Rubberduck (in case that wasn't explicit enough - full disclosure, I'm a contributor). You'll need to actually parse the code. The first issue that you'll run into are line continuations:
'Comment with a line _
continuation
Debug.Print 'End of line comment _
with line continuation.
Debug.Print 'Multiple line continuation operators _ _
still work.
Debug.Print 'This is actually *not* a line continuation_
Debug.Print 42
This makes it difficult to identify string literals, especially you're using line-by-line processing:
Debug.Print 42 'The next line... _
"...is not a string literal"
You also have to handle the old Rem
comment syntax...
Rem old school comment
...which also support line continuations:
Rem old school comment with line _
continuation.
You might be thinking "that can't be so bad, Rem has to start a line". If you are, you forgot about the statement separator (:
)...
Debug.Print 42: Rem statement separator comment.
...or its evil twin the statement separator combined with a line continuation:
Debug.Print 42: Rem this can be _
continued too.
You covered a couple of the issues with sorting out string literals and comments like these...
Debug.Print "Unmatched double quotes." 'Comment"
Debug.Print "Interleaved single 'n double quotes." 'Comment"
...but what about bracketed identifiers like this beast (courtesy of @ThunderFrame)?
'No comments or strings in the line below.
Debug.Print [Evil:""Comment"'here]
Note that the syntax highlighter SO uses doesn't even catch all of these bizarre corner cases.