Regular expression for nested tags (innermost to make it easier)

时光总嘲笑我的痴心妄想 提交于 2019-12-20 03:05:59

问题


I researched this quite a bit, but couldn't find a working example how to match nested html tags with attributes. I know it is possible to match balanced/nested innermost tags without attributes (for example a regex for and would be #<div\b[^>]*>(?:(?> [^<]+ ) |<(?!div\b[^>]*>))*?</div>#x).

However, I would like to see a regex pattern that finds an html tag pair with attributes.

Example: It basically should match

<div class="aaa"> **<div class="aaa">** <div> <div> </div> **</div>** </div>

and not

<div class="aaa"> **<div class="aaa">** <div> <div> **</div>** </div> </div>

Anybody has some ideas?

For testing purposes we could use: http://www.lumadis.be/regex/test_regex.php


PS. Steven mentioned a solution in his blog (actually in a comment), but it doesn't work

http://blog.stevenlevithan.com/archives/match-innermost-html-element

$regex = '/<div\b[^>]+?\bid\s*=\s*"MyID"[^>]*>(?:((?:[^<]++|<(?!\/?div\b[^>]*>))+)|(<div\b[^>]*>(?>(?1)|(?2))*<\/div>))?<\/div>/i';

回答1:


RegEx match open tags except XHTML self-contained tags

And indeed, it is absolutely impossible. HTML has something unique, something magical, which is immune to RegEx.




回答2:


I built a brief python script to solve the issue of managing nested tags. It runs happily with html and with other, terrible nested syntaxes too, as wiki code. Hyronically, I wrote it to avoid regex! I couldn't understand them at all. :-(. I used that function for anything, it runs very well for html and xml. It's fast too, since it only uses basic string search. I'm very happy to know that regex can't help. :-)

I'd like to share the script, if anyone of you is interested; but consider, I'm not a programmer, I presume that the issue has been solved for a long time!

You can find me at my talk page into it.source: http://it.wikisource.org/wiki/Discussioni_utente:Alex_brollo




回答3:


Matching innermost matching pairs of <div> & </div> tags, plus their attributes & content:

#<div(?:(?!(<div|</div>)).)*</div>#s

The key here is that (?:(?!STRING).)* is to strings as [^CHAR]* is to characters.

Credit: https://stackoverflow.com/a/6996274


Example in PHP:

<?php

$text = <<<'EOD'
<div id="1">
  in 1
  <div id="2">
    in 2
    <div id="3">
      in 3
    </div>
  </div>
</div>
<div id="4">
  in 4
  <div id="5">
    in 5
  </div>
</div>
EOD;

$matches = array();
preg_match_all('#<div(?:(?!(<div|</div>)).)*</div>#s', $text, $matches);

foreach ($matches[0] as $index => $match) {
  echo "************" . "\n" . $match . "\n";
}

Outputs:

************
<div id="3">
      in 3
    </div>
************
<div id="5">
    in 5
  </div>


来源:https://stackoverflow.com/questions/3076219/regular-expression-for-nested-tags-innermost-to-make-it-easier

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!