I am making a key-value parser where the input string takes the form of key:\"value\",key2:\"value\"
. Keys can contain the characters a-z
, A-Z
You could use the below regex to get the key value pair.
([a-zA-Z0-9]+):"(.*?)(?<!\\)"
OR
([a-zA-Z0-9]+):"(.*?)"(?=,[a-zA-Z0-9]+:"|$)
DEMO
Java regex would be,
"([a-zA-Z0-9]+):\"(.*?)(?<!\\\\)\""
(?<!\\)"
negative lookbehind asserts that the double quotes won't be preceeded by a backslash character. In java, to match a backslash character, you need to escape the backslash in your pattern exactly three times, ie, \\\\
DEMO
String s = "joe:\"Look over there\\, it's a shark!\",sam:\"I like fish.\"";
Matcher m = Pattern.compile("([a-zA-Z0-9]+):\"(.*?)(?<!\\\\)\"").matcher(s);
while(m.find())
{
System.out.println(m.group(1) + " --> " + m.group(2));
}
}
Output:
joe --> Look over there\, it's a shark!
sam --> I like fish.
OR
String s = "joe:\"Look over there\\, i\\\"t's a shark!\",sam:\"I like fish.\"";
Matcher m = Pattern.compile("([a-zA-Z0-9]+):\"((?:\\\\\"|[^\"])*)\"").matcher(s);
while(m.find())
{
System.out.println(m.group(1) + " --> " + m.group(2));
}
}
Output:
joe --> Look over there\, i\"t's a shark!
sam --> I like fish.
Assuming that \
followed by any character except for line terminator specifies the character immediately following it.
You can use the following regex to match all instances of key-value pairs:
"([a-zA-Z0-9]+):\"((?:[^\\\\\"]|\\\\.)*+)\""
Add \\s*
before and after :
if you want to allow free spacing.
This is what the regex engine sees:
([a-zA-Z0-9]+):"((?:[^\\"]|\\.)*+)"
The quantifier *
is made possessive *+
, since the 2 branches [^\\"]
and \\.
are mutual exclusive (no string can be matched by both at the same time). It also avoids StackOverflowError
in the Oracle's implementation of Pattern
class.
Use the regex above in a Matcher loop:
Pattern keyValuePattern = Pattern.compile("([a-zA-Z0-9]+):\"((?:[^\\\\\"]|\\\\.)*+)\"");
Matcher matcher = keyValuePattern.matcher(inputString);
while (matcher.find()) {
String key = matcher.group(1);
// Process the escape sequences in the value string
String value = matcher.group(2).replaceAll("\\\\(.)", "$1");
// ...
}
In general case, depending on the complexity of the escape sequences (e.g. \n
, \uhhhh
, \xhh
, \0
), you might want to write a separate function to parse them. However, with the assumption above, the one-liner suffices.
Note that this solution doesn't care about the separators, though. And it will skip on invalid input to the nearest match. In the example of invalid input below, the solution above will skip abc:"
at the beginning and happily match xyz:"text text"
amd more:"pair"
as key-value pairs:
abc:"xyz:"text text", more:"pair"
If this behavior is not desirable, there is a solution, but the string containing all the key-value pairs must be isolated first, instead of being part of a bigger string that doesn't have anything to do with key-value pairs:
"(?:^|(?!^)\\G,)([a-zA-Z0-9]+):\"((?:[^\\\\\"]|\\\\.)*+)\""
Free-spacing version:
"(?:^\s*|(?!^)\\G\s*,\s*)([a-zA-Z0-9]+)\s*:\s*\"((?:[^\\\\\"]|\\\\.)*+)\""