问题
I have strings input by the user and want to tokenize them. For that, I want to use regex and now have a problem with a special case. An example string is
Test + "Hello" + "Good\"more" + "Escape\"This\"Test"
or the C# equivalent
@"Test + ""Hello"" + ""Good\""more"" + ""Escape\""This\""Test"""
I am able to match the Test
and +
tokens, but not the ones contained by the ". I use the " to let the user specify that this is literally a string and not a special token. Now if the user wants to use the " character in the string, I thought of allowing him to escape it with a \.
So the rule would be: Give me everything between two " ", but the character in front of the last " can not be a \.
The results I expect are: "Hello"
"Good\"more"
"Escape\"This\"Test"
I need the " " characters to be in the final match so I know that this is a string.
I currently have the regex @"""([\w]*)(?<!\\"")"""
which gives me the following results: "Hello"
"more"
"Test"
So the look behind isn't working as I want it to be. Does anyone know the correct way to get the string like I want?
回答1:
Here's an adaption of a regex I use to parse command lines:
(?!\+)((?:"(?:\\"|[^"])*"?|\S)+)
Example here at regex101
(adaption is the negative look-ahead to ignore +
and checking for \"
instead of ""
)
Hope this helps you.
Regards.
Edit:
If you aren't interested in surrounding quotes:
(?!\+)(?:"((?:\\"|[^"])*)"?|(\S+))
回答2:
To make it safer, I'd suggest getting all the substrings within unescaped pairs of "..."
with the following regex:
^(?:[^"\\]*(?:\\.[^"\\]*)*("[^"\\]*(?:\\.[^"\\]*)*"))+
It matches
^
- start of string (so that we could check each"
and escape sequence)(?:
- Non-capturing group 1 serving as a container for the subsequent subpatterns[^"\\]*(?:\\.[^"\\]*)*
- matches 0+ characters other than"
and\
followed with 0+ sequences of\\.
(any escape sequence) followed with 0+ characters other than"
and\
(thus, we avoid matching the first"
that is escaped, and it can be preceded with any number of escape sequences)("[^"\\]*(?:\\.[^"\\]*)*")
- Capture group 1 matching"..."
substrings that may contain any escape sequences inside
)+
- end of the first non-capturing group that is repeated 1 or more times
See the regex demo and here is a C# demo:
var rx = "^(?:[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"))+";
var s = @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
var matches = Regex.Matches(s, rx)
.Cast<Match>()
.SelectMany(m => m.Groups[1].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
.ToList();
Console.WriteLine(string.Join("\n", matches));
UPDATE
If you need to remove the tokens, just match and capture all outside of them with this code:
var keep = "[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*";
var rx = string.Format("^(?:(?<keep>{0})\"{0}\")+(?<keep>{0})$", keep);
var s = @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
var matches = Regex.Matches(s, rx)
.Cast<Match>()
.SelectMany(m => m.Groups["keep"].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
.ToList();
Console.WriteLine(string.Join("", matches));
See another demo
Output: Test + + + \"Escape\"This\"Test\" +
for @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
.
来源:https://stackoverflow.com/questions/35790482/regex-tokenize-issue