Regex for nested XML attributes

Lets say I have following string:

"<aa v={<dd>sop</dd>} z={ <bb y={ <cc x={st}>ABC</cc> }></bb> }></aa>"

How can I write general purpose regex (tag names change, attribute names change) to match content inside {}, either <dd>sop</dd> or <bb y={ <cc x={st}>ABC</cc> }></bb>.

Regex I wrote "(\s*\w*=\s*\{)\s*(<.*>)\s*(\})" matches

"<dd>sop</dd>} z={ <bb y={ <cc x={st}>ABC</cc> }></bb>" which is not correct.

In generic regex there's no way to handle nesting in a good way. Hence all the wining when a question like this comes up - never use regex to parse XML/HTML.

In some simple cases it might be advantageous though. If, like in your example, there's a limited number of levels of nesting, you can quite simply add one regex for each level.

Now let's do this in steps. To handle the first un-nested attribute you can use

{[^}]*}

This matches a starting brace followed by any number of anything but a closing brace, finally followed by a closing brace. For simplicity I'm gonna put the heart of it in a non capturing group, like

{(?:[^}])*}

This is because when inserting the alternate ones, it's needed.

If you now allow for that anything but a closing brace ([^}]) to also be another nested level of braces and simply join with the first regex, like

{(?:{[^}]*}|[^}])*}
    ^^^^^^^    original regex inserted as alternative (to it self)

it allows for one level of nesting. Doing the same again, joining this regex as an alternative to itself, like

{(?:{(?:{[^}]*}|[^}])*}|{[^}]*}|[^}])*}
        ^^^^^^^^^^^^^^^    previous level repeated

will allow for another level of nesting. This can be repeated for more levels if wanted.

This doesn't handle the capture of attribute names and stuff though, because your question isn't quite clear on what you want there, but it shows you one way (i.m.o. the easiest to understand, or... :P) to handle nesting in regex.

You can see it handle your example here at regex101.

Regards

You're trying to deal with balanced set of braces. This requires recursive regular expressions. By definition, recursive regexes are not regular. Anyway, some languages support them, e.g. Perl, PHP, ruby. This is a good tutorial on the topic.

Generally, you should extract this kind of information with a fully-fledged parser, like yacc.

This is a regex that can deal with the non-balanced braces: ([ =]*)=(\{[^}]*\}). This will match {<dd>sop</dd>} and {st} which is correct. Unfortunately, it will match { <bb y={ <cc x={st} too, which is not quite what you want.

来源：https://stackoverflow.com/questions/37113364/regex-for-nested-xml-attributes

标签

regex

recursive-regex