Regular expression for nested tags (innermost to make it easier)

前端未结

关注

 4  819

I researched this quite a bit, but couldn\'t find a working example how to match nested html tags with attributes. I know it is possible to match balanced/neste

相关标签:

4条回答

没有蜡笔的小新

2021-01-23 02:26

Matching innermost matching pairs of <div> & </div> tags, plus their attributes & content:

#<div(?:(?!(<div|</div>)).)*</div>#s

The key here is that (?:(?!STRING).)* is to strings as [^CHAR]* is to characters.

Credit: https://stackoverflow.com/a/6996274

Example in PHP:

<?php

$text = <<<'EOD'
<div id="1">
  in 1
  <div id="2">
    in 2
    <div id="3">
      in 3
    </div>
  </div>
</div>
<div id="4">
  in 4
  <div id="5">
    in 5
  </div>
</div>
EOD;

$matches = array();
preg_match_all('#<div(?:(?!(<div|</div>)).)*</div>#s', $text, $matches);

foreach ($matches[0] as $index => $match) {
  echo "************" . "\n" . $match . "\n";
}

Outputs:

************
<div id="3">
      in 3
    </div>
************
<div id="5">
    in 5
  </div>

0 讨论(0)

不思量自难忘°

2021-01-23 02:30

I built a brief python script to solve the issue of managing nested tags. It runs happily with html and with other, terrible nested syntaxes too, as wiki code. Hyronically, I wrote it to avoid regex! I couldn't understand them at all. :-(. I used that function for anything, it runs very well for html and xml. It's fast too, since it only uses basic string search. I'm very happy to know that regex can't help. :-)

I'd like to share the script, if anyone of you is interested; but consider, I'm not a programmer, I presume that the issue has been solved for a long time!

You can find me at my talk page into it.source: http://it.wikisource.org/wiki/Discussioni_utente:Alex_brollo

0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2021-01-23 02:45

RegEx match open tags except XHTML self-contained tags

And indeed, it is absolutely impossible. HTML has something unique, something magical, which is immune to RegEx.

0 讨论(0)
发布评论:

提交评论
- 加载中...

独厮守ぢ

2021-01-23 02:47

You can do it recursively, using the same regex but executed while needed. Like this:

function htmlToPlainText(html) {
    let text = html || ''

    // as there is html nested inside some html attributes, we need a recursive strategy to clean up the html
    while (text !== (text = text.replace(/<[^<>]*>/g, '')));

    return text
  }

This works with cases like:

<p data-attr="<span>Oh!</span>">Lorem Ipsum</p>

I found this script here: http://blog.stevenlevithan.com/archives/reverse-recursive-pattern

0 讨论(0)