How to fix a BBcode regular expression

落爺英雄遲暮 提交于 2019-12-10 20:57:42

问题


I have a regular expression that grabs BBcode tags. It works great except for a minor glitch.

Here is the current expression:

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]

Here is some text it successfully matches against and the groups it builds:

[url=http://www.google.com]Go to google![/url]
1: url
2: http://www.google.com
3: Go to google!

[img]http://www.somesite.com/someimage.jpg[/img]
1: img
2: NULL
3: http://www.somesite.com/someimage.jpg

[quote][quote]first nested quote[/quote][quote]second nested quote[/quote][/quote]
1: quote
2: NULL
3: [quote]first nested quote[/quote][quote]second nested quote[/quote]

All of this is great. I can handle nested tags by running the 3rd match group against the same regex and recursively handle all tags that are nested. The problem is with the example using the [quote] tags. Notice that the 3rd match group is a set of two quote tags, so we would expect two matches. However, we get one match, like this:

[quote]first nested quote[/quote][quote]second nested quote[/quote]
1: quote
2: NULL
3: first nested quote[/quote][quote]second nested quote

Ahhhh! That's not what we wanted at all. There is a fairly simple way to fix it, I modify the regex from this:

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]

To this:

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](((?!\[/\1\]).)+)\[/\1\]

By adding ((?!\[/\1\]).) we invalidate the entire match if the 3rd match group contains the closing BBcode tag. So now this works, we get two matches:

[quote]first nested quote[/quote][quote]second nested quote[/quote]

[quote]first nested quote[/quote]
1: quote
2: NULL
3: first nested quote

[quote]second nested quote[/quote]
1: quote
2: NULL 3: second nested quote

I was happy that fixed it, but now we have another problem. This new regex fails on the first one where we nest the two quote tags under one larger quote tag. We get two matches instead of one:

[quote][quote]first nested quote[/quote][quote]second nested quote[/quote][/quote]

[quote][quote]first nested quote[/quote]
1: quote
2: NULL
3: [quote]first nested quote

[quote]second nested quote[/quote]
1: quote
2: NULL
3: second nested quote

The first match is all wrong and the second match, while well-formed, is not a desired match. We wanted one big match with the 3rd match group being the two nested quote tags, like when we used the first expression.

Any suggestions? If I can just cross this gap I should have a fairly powerful BBcode expression.


回答1:


Using balancing groups you can construct a regex like this:

(?>
  \[ (?<tag>[^][/=\s]+) \s*
  (?: = \s* (?<val>[^][]*) \s*)?
  ]
)

(?<content>
  (?>
    \[(?<innertag>[^][/=\s]+)[^][]*]
    |
    \[/(?<-innertag>\k<innertag>)]
    |
    [^][]+
  )*
  (?(innertag)(?!))
)

\[/\k<tag>]

Simplified according to Kobi's example.


In the following:

[foo=bar]baz[/foo]
[b]foo[/b]
[i][i][foo=bar]baz[/foo]foo[/i][/i]
[i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
[quote][quote][b][img]foo[/img][b]bold[/b][b][b]deep[/b][/b][/b][/quote]bar[quote]baz[/quote][/quote]

It finds these matches:

  • [foo=bar]baz[/foo]
  • [b]foo[/b]
  • [i][i][foo=bar]baz[/foo]foo[/i][/i]
  • [i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
  • [quote][quote][b][img]foo[/img][b]bold[/b][b][b]deep[/b][/b][/b][/quote]bar[quote]baz[/quote][/quote]

Full example at http://ideone.com/uULOs

(Old version http://ideone.com/AXzxW)



来源:https://stackoverflow.com/questions/7018321/how-to-fix-a-bbcode-regular-expression

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!