I\'m implementing some kind of parser and I need to locate and deserialize json object embedded into other semi-structured data. I used regexp:
As others have suggested, a full-blown JSON parser is probably the way to go. If you want to match the key-value pairs in the simple examples that you have above, you could use:
(?<=\{)\s*[^{]*?(?=[\},])
For the input string
{title:'Title', {data:'Data', {foo: 'Bar'}}}
This matches:
1. title:'Title'
2. data:'Data'
3. foo: 'Bar'
Thanks to @Sanjay T. Sharma that pointed me to "brace matching" because I eventually got some understanding of greedy expressions and also thanks to others for saying initially what I shouldn't do. Fortunately it turned out it's OK to use greedy variant of expression
\\{\s*title.*\\}
because there is no non-JSON data between closing brackets.
This recursive Perl/PCRE regular expression should be able to match any valid JSON or JSON5 object, including nested objects and edge cases such as braces inside JSON strings or JSON5 comments:
/(\{(?:(?>[^{}"'\/]+)|(?>"(?:(?>[^\\"]+)|\\.)*")|(?>'(?:(?>[^\\']+)|\\.)*')|(?>\/\/.*\n)|(?>\/\*.*?\*\/)|(?-1))*\})/
Of course, that's a bit hard to read, so you might prefer the commented version:
m{
( # Begin capture group (matching a JSON object).
\{ # Match opening brace for JSON object.
(?: # Begin non-capturing group to contain alternations.
(?>[^{}"'\/]+) # Match a non-empty string which contains no braces, quotes or slashes, without backtracking.
| # Alternation; next alternative follows.
(?>"(?:(?>[^\\"]+)|\\.)*") # Match a double-quoted JSON string, without backtracking.
| # Alternation; next alternative follows.
(?>'(?:(?>[^\\']+)|\\.)*') # Match a single-quoted JSON5 string, without backtracking.
| # Alternation; next alternative follows.
(?>\/\/.*\n) # Match a single-line JSON5 comment, without backtracking.
| # Alternation; next alternative follows.
(?>\/\*.*?\*\/) # Match a multi-line JSON5 comment, without backtracking.
| # Alternation; next alternative follows.
(?-1) # Recurse to most recent capture group, to match a nested JSON object.
)* # End of non-capturing group; match zero or more repetitions of this group.
\} # Match closing brace for JSON object.
) # End of capture group (matching a JSON object).
}x
This is absolutely horrible and I can't believe I'm actually putting my name to this solution, but could you not locate the first {
character that is in a Javascript block and attempt to parse the remaining characters through a proper JSON parsing library? If it works, you've got a match. If it doesn't, keep reading until the next {
character and start over.
There are a few issues there, but they can probably be worked around:
<script>...</script>
blocks.An improvement would be, once you've found the first {
, to look for the matching }
one (a simple counter that is incremented whenever you find a {
and decremented when you find a }
should do the trick). Attempt to parse the resulting string as JSON. Iterate until it works or you've ran out of likely blocks.
This is ugly, hackish and should never make it to production code. I get the impression that you only need it for a batch-job, though, which is why I'm even suggesting it.