问题
I'm currently working on a C++ preprocessor and I need to match string constants with more than 0 letters like this "hey I'm a string
.
I'm currently working with this one here \"([^\\\"]+|\\.)+\"
but it fails on one of my test cases.
Test cases:
std::cout << "hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
Expected output:
std::cout << String("hello") << String(" world");
std::cout << String("He said: \"bananas\"") << String("...");
std::cout << "";
std::cout << String("\x12\23\x34");
On the second one I instead get
std::cout << String("He said: \")bananas\"String(" << ")...";
Short repro code (using the regex by AR.3):
std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";";
std::regex r("\"([^\"]+|\\.|(?<=\\\\)\")+\"");
in_line = std::regex_replace(in_line, r, "String($&)");
回答1:
Lexing a source file is a good job for regexes. But for such a task, let's use a better regex engine than std::regex
. Let's use PCRE (or boost::regex
) at first. At the end of this post, I'll show what you can do with a less feature-packed engine.
We only need to do partial lexing, ignoring all unrecognized tokens that won't affect string literals. What we need to handle is:
- Singleline comments
- Multiline comments
- Character literals
- String literals
We'll be using the extended (x
) option, which ignores whitespace in the pattern.
Comments
Here's what [lex.comment]
says:
The characters
/*
start a comment, which terminates with the characters*/
. These comments do not nest. The characters//
start a comment, which terminates immediately before the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. [ Note: The comment characters//
,/*
, and*/
have no special meaning within a//
comment and are treated just like other characters. Similarly, the comment characters//
and/*
have no special meaning within a/*
comment. — end note ]
# singleline comment
// .* (*SKIP)(*FAIL)
# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
Easy peasy. If you match anything there, just (*SKIP)(*FAIL)
- meaning that you throw away the match. The (?s: .*? )
applies the s
(singleline) modifier to the .
metacharacter, meaning it's allowed to match newlines.
Character literals
Here's the grammar from [lex.ccon]
:
character-literal: encoding-prefix(opt) ’ c-char-sequence ’ encoding-prefix: one of u8 u U L c-char-sequence: c-char c-char-sequence c-char c-char: any member of the source character set except the single-quote ’, backslash \, or new-line character escape-sequence universal-character-name escape-sequence: simple-escape-sequence octal-escape-sequence hexadecimal-escape-sequence simple-escape-sequence: one of \’ \" \? \\ \a \b \f \n \r \t \v octal-escape-sequence: \ octal-digit \ octal-digit octal-digit \ octal-digit octal-digit octal-digit hexadecimal-escape-sequence: \x hexadecimal-digit hexadecimal-escape-sequence hexadecimal-digit
Let's define a few things first, which we'll need later on:
(?(DEFINE)
(?<prefix> (?:u8?|U|L)? )
(?<escape> \\ (?:
['"?\\abfnrtv] # simple escape
| [0-7]{1,3} # octal escape
| x [0-9a-fA-F]{1,2} # hex escape
| u [0-9a-fA-F]{4} # universal character name
| U [0-9a-fA-F]{8} # universal character name
))
)
prefix
is defined as an optionalu8
,u
,U
orL
escape
is defined as per the standard, except that I've mergeduniversal-character-name
into it for the sake of simplicity
Once we have these, a character literal is pretty simple:
(?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
We throw it away with (*SKIP)(*FAIL)
Simple strings
They're defined in almost the same way as character literals. Here's a part of [lex.string]
:
string-literal: encoding-prefix(opt) " s-char-sequence(opt) " encoding-prefix(opt) R raw-string s-char-sequence: s-char s-char-sequence s-char s-char: any member of the source character set except the double-quote ", backslash \, or new-line character escape-sequence universal-character-name
This will mirror the character literals:
(?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "
The differences are:
- The character sequence is optional this time (
*
instead of+
) - The double quote is disallowed when unescaped instead of the single quote
- We actually don't throw it away :)
Raw strings
Here's the raw string part:
raw-string: " d-char-sequence(opt) ( r-char-sequence(opt) ) d-char-sequence(opt) " r-char-sequence: r-char r-char-sequence r-char r-char: any member of the source character set, except a right parenthesis ) followed by the initial d-char-sequence (which may be empty) followed by a double quote ". d-char-sequence: d-char d-char-sequence d-char d-char: any member of the basic source character set except: space, the left parenthesis (, the right parenthesis ), the backslash \, and the control characters representing horizontal tab, vertical tab, form feed, and newline.
The regex for this is:
(?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
[^ ()\\\t\x0B\r\n]*
is the set of characters that are allowed in delimiters (d-char
)\k<delimiter>
refers to the previously matched delimiter
The full pattern
The full pattern is:
(?(DEFINE)
(?<prefix> (?:u8?|U|L)? )
(?<escape> \\ (?:
['"?\\abfnrtv] # simple escape
| [0-7]{1,3} # octal escape
| x [0-9a-fA-F]{1,2} # hex escape
| u [0-9a-fA-F]{4} # universal character name
| U [0-9a-fA-F]{8} # universal character name
))
)
# singleline comment
// .* (*SKIP)(*FAIL)
# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
# character literal
| (?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
# standard string
| (?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "
# raw string
| (?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
See the demo here.
boost::regex
Here's a simple demo program using boost::regex:
#include <string>
#include <iostream>
#include <boost/regex.hpp>
static void test()
{
boost::regex re(R"regex(
(?(DEFINE)
(?<prefix> (?:u8?|U|L) )
(?<escape> \\ (?:
['"?\\abfnrtv] # simple escape
| [0-7]{1,3} # octal escape
| x [0-9a-fA-F]{1,2} # hex escape
| u [0-9a-fA-F]{4} # universal character name
| U [0-9a-fA-F]{8} # universal character name
))
)
# singleline comment
// .* (*SKIP)(*FAIL)
# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
# character literal
| (?&prefix)? ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
# standard string
| (?&prefix)? " (?> (?&escape) | [^"\\\r\n]+ )* "
# raw string
| (?&prefix)? R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
)regex", boost::regex::perl | boost::regex::no_mod_s | boost::regex::mod_x | boost::regex::optimize);
std::string subject(R"subject(
std::cout << L"hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
std::cout << u8R"hello(this"is\a\""""single\\(valid)"
raw string literal)hello";
"" // empty string
'"' // character literal
// this is "a string literal" in a comment
/* this is
"also inside"
//a comment */
// and this /*
"is not in a comment"
// */
"this is a /* string */ with nested // comments"
)subject");
std::cout << boost::regex_replace(subject, re, "String\\($&\\)", boost::format_all) << std::endl;
}
int main(int argc, char **argv)
{
try
{
test();
}
catch(std::exception ex)
{
std::cerr << ex.what() << std::endl;
}
return 0;
}
(I left syntax highlighting disabled because it goes nuts on this code)
For some reason, I had to take the ?
quantifier out of prefix
(change (?<prefix> (?:u8?|U|L)? )
to (?<prefix> (?:u8?|U|L) )
and (?&prefix)
to (?&prefix)?
) to make the pattern work. I believe it's a bug in boost::regex, as both PCRE and Perl work just fine with the original pattern.
What if we don't have a fancy regex engine at hand?
Note that while this pattern technically uses recursion, it never nests recursive calls. Recursion could be avoided by inlining the relevant reusable parts into the main pattern.
A couple of other constructs can be avoided at the price of reduced performance. We can safely replace the atomic groups (?>
...)
with normal groups (?:
...)
if we don't nest quantifiers in order to avoid catastrophic backtracking.
We can also avoid (*SKIP)(*FAIL)
if we add one line of logic into the replacement function: All the alternatives to skip are grouped in a capturing group. If the capturing group matched, just ignore the match. If not, then it's a string literal.
All of this means we can implement this in JavaScript, which has one of the simplest regex engines you can find, at the price of breaking the DRY rule and making the pattern illegible. The regex becomes this monstrosity once converted:
(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"
And here's an interactive demo you can play with:
function run() {
var re = /(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"/g;
var input = document.getElementById("input").value;
var output = input.replace(re, function(m, ignore) {
return ignore ? m : "String(" + m + ")";
});
document.getElementById("output").innerText = output;
}
document.getElementById("input").addEventListener("input", run);
run();
<h2>Input:</h2>
<textarea id="input" style="width: 100%; height: 50px;">
std::cout << L"hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
std::cout << u8R"hello(this"is\a\""""single\\(valid)"
raw string literal)hello";
"" // empty string
'"' // character literal
// this is "a string literal" in a comment
/* this is
"also inside"
//a comment */
// and this /*
"is not in a comment"
// */
"this is a /* string */ with nested // comments"
</textarea>
<h2>Output:</h2>
<pre id="output"></pre>
回答2:
Read the relevant sections from the C++ standard, they are called lex.ccon
and lex.string
.
Then convert each rule you find there into a regular expression (if you really want to use regular expressions; it might turn out that they are not capable of doing this job).
Then, build more complicated regular expressions out of them. Be sure to name your regular expressions exactly as the rules from the C++ standard, so that you can recheck them later.
If, instead of using regular expressions, you want to use an existing tool, here is one: http://clang.llvm.org/doxygen/Lexer_8cpp_source.html. Have a look at the LexStringLiteral
function.
回答3:
Regular expressions can be tricky for beginners but once you understand it's basics and well tested divide and conquer strategy, it will be your goto tool.
What you need to search for quote (") not starting with () back slash and read all characters upto next quote.
The regex I came up is (".*?[^\\]")
. See a code snippet below.
std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";";
std::regex re(R"((".*?[^\\]"))");
in_line = std::regex_replace(in_line, re, "String($1)");
std::cout << in_line << endl;
Output:
std::cout << String("He said: \"bananas\"") << String("...");
Regex Explanation:
(".*?[^\\]")
Options: Case sensitive; Numbered capture; Allow zero-length matches; Regex syntax only
- Match the regex below and capture its match into backreference number 1
(".*?[^\\]")
- Match the character “"” literally
"
- Match any single character that is NOT a line break character (line feed, carriage return)
.*?
- Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
*?
- Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
- Match any character that is NOT the backslash character
[^\\]
- Match the character “"” literally
"
- Match the character “"” literally
String($1)
- Insert the character string “String” literally
String
- Insert an opening parenthesis
(
- Insert the text that was last matched by capturing group number 1
$1
- Insert a closing parenthesis
)
来源:https://stackoverflow.com/questions/41909225/regex-for-matching-c-string-constant