Regex for nested XML attributes

扶醉桌前 提交于 2019-12-02 08:38:05

In generic regex there's no way to handle nesting in a good way. Hence all the wining when a question like this comes up - never use regex to parse XML/HTML.

In some simple cases it might be advantageous though. If, like in your example, there's a limited number of levels of nesting, you can quite simply add one regex for each level.

Now let's do this in steps. To handle the first un-nested attribute you can use

{[^}]*}

This matches a starting brace followed by any number of anything but a closing brace, finally followed by a closing brace. For simplicity I'm gonna put the heart of it in a non capturing group, like

{(?:[^}])*}

This is because when inserting the alternate ones, it's needed.

If you now allow for that anything but a closing brace ([^}]) to also be another nested level of braces and simply join with the first regex, like

{(?:{[^}]*}|[^}])*}
    ^^^^^^^    original regex inserted as alternative (to it self)

it allows for one level of nesting. Doing the same again, joining this regex as an alternative to itself, like

{(?:{(?:{[^}]*}|[^}])*}|{[^}]*}|[^}])*}
        ^^^^^^^^^^^^^^^    previous level repeated

will allow for another level of nesting. This can be repeated for more levels if wanted.

This doesn't handle the capture of attribute names and stuff though, because your question isn't quite clear on what you want there, but it shows you one way (i.m.o. the easiest to understand, or... :P) to handle nesting in regex.

You can see it handle your example here at regex101.

Regards

You're trying to deal with balanced set of braces. This requires recursive regular expressions. By definition, recursive regexes are not regular. Anyway, some languages support them, e.g. Perl, PHP, ruby. This is a good tutorial on the topic.

Generally, you should extract this kind of information with a fully-fledged parser, like yacc.

This is a regex that can deal with the non-balanced braces: ([ =]*)=(\{[^}]*\}). This will match {<dd>sop</dd>} and {st} which is correct. Unfortunately, it will match { <bb y={ <cc x={st} too, which is not quite what you want.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!