Using a regular expression to match a div block having a specific ID

前端未结

关注

 5  1779

I\'m trying to match a block of div that have a particular id.. Here\'s my regex code:

]*\\s*id\\s*=\\s*[\"|\']content[\"|\']\\s*>[^/div]


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2021-01-03 07:29
              
            
            
                                                                       
The [^/div]+ will stop when it reaches any of those characters, which is not what you want.  As it'll stop when it reaches  too because of the i.

Unfortunately, you can't do what you want without knowing the internal structure of the HTML in the first place.  Consider this:

<div id="content">
  <div id="somethingelse">
  </div>
</div>


Even if you could construct a regexp that would match up till the </div>, you can't construct one that will match up until the correct </div>.  You need to do a much more intensive parsing.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  盖世英雄少女心        
                
              
                            
                2021-01-03 07:41
              
            
            
                                                                       
Use a parser, not a regex.

Here's a PHP example: http://htmlparsing.com/php.html
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  借酒劲吻你        
                
              
                            
                2021-01-03 07:43
              
            
            
                                                                       
DISCLAIMER: First, I agree that, in general, regex is not the best tool for parsing HTML. However, in the right hands, (and with a few caveats), Philip Hazel's powerful (and most assuredly non-REGULAR) PCRE library, (used by PHP's preg_*() family of functions), does allow solving non-trivial data scraping problems such as this one (with some limitations and caveats - see below). The problem stated above is particularly complex to solve using regex alone, and regex solutions such as the one presented below are not for everyone and should never be attempted by a regex novice. To properly understand the answer below requires fairly deep comprehension of several advanced regex constructs and techniques.

Won't someone please think of the Children! Yes, I have read bobince's legendary answer and I know this is a touchy subject around here (to say the least). But please, if you are tempted to immediately click the down-vote arrow, because I am '/(?:actual|brave|stupid)ly/' using the words: REGEX and: HTML in the same breath (and on a non-trivial problem no-less), I would humbly ask you to refrain long enough to read this entire post and to actually try this solution out for yourself.

With that in mind, if you would like to see how an advanced regex can be crafted to solve this problem, (for all but a few (unlikely) special cases - see below for examples), read on...

AN ADVANCED RECURSIVE REGEX SOLUTION: As Wes Hardaker correctly points out, DIVs can (and frequently are) nested. However, he is not 100% correct when he says "you can't construct one that will match up until the correct </div>". The truth is, with PHP, you can! (with some limitations - see below). Like Perl and .NET, the PCRE regex engine in PHP provides recursive expressions (i.e. (?R), (?1), (?2), etc) which allow matching nested structures to any arbitrary depth (limited only by memory). For example, you can easily match balanced nested parentheses with this expression: '/\((?:[^()]++|(?R))*+\)/'. Run this simple test if you have any doubts:

$text = 'zero(one(two)one(two(three)two)one)zero';
if (preg_match('/\((?:[^()]++|(?R))*+\)/', $text, $matches)) {
    print_r($matches);
}


So if we can all agree that a PHP regex can, indeed, match nested structures, let's move on to the problem at hand. This particular problem is complicated by the fact that the outermost DIV must have the id="content" attribute, but any nested DIVs may or may not. Thus, we can't use the (?R) recursively-match-the-whole-expression construct, because the subexpression to match the outer DIV is not the same as the one needed to match the inner DIVs. In this case, we need to have a capture group (in this case group 2), that will serve as a "recursive subroutine", which matches inner, nested DIV's.  So here is a tested PHP code snippet, sporting an advanced not-for-the-faint-of-heart-but-fully-commented-so-that-you-might-actually-be-able-to-make-some-sense-out-of-it regex, which correctly matches (in most cases - see below), a DIV having id="content", which may itself contain nested DIVs:

$re = '% # Match a DIV element having id="content".
    <div\b             # Start of outer DIV start tag.
    [^>]*?             # Lazily match up to id attrib.
    \bid\s*+=\s*+      # id attribute name and =
    ([\'"]?+)          # $1: Optional quote delimiter.
    \bcontent\b        # specific ID to be matched.
    (?(1)\1)           # If open quote, match same closing quote
    [^>]*+>            # remaining outer DIV start tag.
    (                  # $2: DIV contents. (may be called recursively!)
      (?:              # Non-capture group for DIV contents alternatives.
      # DIV contents option 1: All non-DIV, non-comment stuff...
        [^<]++         # One or more non-tag, non-comment characters.
      # DIV contents option 2: Start of a non-DIV tag...
      | <            # Match a "<", but only if it
        (?!          # is not the beginning of either
          /?div\b    # a DIV start or end tag,
        | !--        # or an HTML comment.
        )            # Ok, that < was not a DIV or comment.
      # DIV contents Option 3: an HTML comment.
      | <!--.*?-->     # A non-SGML compliant HTML comment.
      # DIV contents Option 4: a nested DIV element!
      | <div\b[^>]*+>  # Inner DIV element start tag.
        (?2)           # Recurse group 2 as a nested subroutine.
        </div\s*>      # Inner DIV element end tag.
      )*+              # Zero or more of these contents alternatives.
    )                  # End 2$: DIV contents.
    </div\s*>          # Outer DIV end tag.
    %isx';
if (preg_match($re, $text, $matches)) {
    printf("Match found:\n%s\n", $matches[0]);
}


As I said, this regex is quite complex, but rest assured, it does work! with the exception of some unlikely cases noted below - (and probably a few more that I would be very grateful if you could find). Try it out and see for yourself!

Should I use this? Would it be appropriate to use this regex solution in a production environment where hundreds or thousands of documents must be parsed with 100% reliability and accuracy? Of course not. Could it be useful for a limited one time run of some HTML files? (e.g. possibly the person who asked this question?) Possibly. It depends on how comfortable one is with advanced regexes. If the regex above looks like it was written in a foreign language (it is), and/or scares the dickens out of you, the answer is probably no.

It works? Yes. For example, given the following test data, the regex above correctly picks out the DIV having the id="content" (or id='content' or id=content for that matter):

<!DOCTYPE HTML SYSTEM>
<html>
<head><title>Test Page</title></head>
<body>
<div id="non-content-div">
    <h1>PCRE does recursion!</h1>
    <div id='content'>
        <h2>First level matched</h2>
        <!-- this comment </div> is tricky -->
        <div id="one-deep">
            <h3>Second level matched</h3>
            <div id=two-deep>
                <h4>Third level matched</h4>
                <div id=three-deep>
                    <h4>Fourth level matched</h4>
                </div>
                <p>stuff</p>
            </div>
            <!-- this comment <div> is tricky -->
            <p>stuff</p>
        </div>
        <p>stuff</p>
    </div>
    <p>stuff</p>
</div>
<p>stuff</p>
</body></html>


CAVEATS: So what are some scenarios where this solution does not work? Well, DIV start tags may NOT have any angle brackets in any of their attributes (it is possible to remove this limitation, but this adds quite a bit more to the code). And the following CDATA spans, which contain the specific DIV start tag we are looking for (highly unlikely), will cause the regex to fail:

<style type="text/css">
p:before {
    content: 'Unlikely CSS string with <div id=content> in it.';
}
</style>
<p title="Unlikely attribute with a <div id=content> in it">stuff</p>
<script type="text/javascript">
    alert("evil script with <div id=content> in it">");
</script>
<!-- Comment with <div id="content"> in it -->
<![CDATA[ a CDATA section with <div id="content"> in it ]]>


I would very much like to know of any others.

GO READ MRE3 As I said before, to truly grasp what is going on here requires a pretty deep understanding of several advanced techniques. These techniques are not obvious or intuitive. There is only one way that I know of to gain these skills and that is to sit down and study: Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl (MRE3). (You will be glad you did!)

I can honestly say that this is the most useful book I have read in my entire life!

Cheers!

EDIT 2013-04-30 Fixed Regex. It previously disallowed a non-DIV tag which immediately followed the DIV start tag.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遇见更好的自我        
                
              
                            
                2021-01-03 07:43
              
            
            
                                                                       
This article is amazing and is a perfect solution for my needs !

It even works on html code where simpleXML or DOMDocument fails !

Sometimes you have to parse html code generated by a third party on which you don't have control and doesn't respect any dtd, so here come the recursive regular expressions.

I just add a few modifications upon your code and used it with PHP preg_match_all function.

In the following expample we will try to match properly the div#content :

$content = <<<HTML
<div id="content">
    <!-- tutu -->
    <div id="something">
        <div id="somethingElse">
            <ul>
                <li>lorem 1</li>
                <li class="dfg" toto="titi">lorem 2</li>
                <li class="dfg">lorem 3</li>
                <li class="dfg">lorem 4</li>
                <li class="dfg">lorem 5</li>
                <li class="dfg">lorem 6</li>
            </ul>
            <br />
            <div id="emptyStuff"></div>
        </div>
    </div>
    <table>
        <tr>
            <td>cell 1</td>
            <td>cell 2</td>
            <td>cell 3</td>
            <td>cell 4</td>
            <td>cell 5</td>
            <td>cell 6</td>
        </tr>
        <tr>
            <td>cell 1</td>
            <td>cell 2</td>
            <td>cell 3</td>
            <td>cell 4</td>
            <td>cell 5</td>
            <td>cell 6</td>
        </tr>
    </table>
</div>
HTML;

$pattern = '@# match nested tag
(?(DEFINE)
    (?<comment>     <!--.*?-->)
    (?<cdata>       <![CDATA[.*?]]>)
    (?<empty>       <\w+[^>]*?/>)
    (?<inline>      <(script|style)[^>]+>.*?</\g{-1}>)
    (?<nested>      <(\w+)[^>]*(?<!/)>(?&innerHTML)</\g{-1}>)
    (?<unclosed>        <\w+[^>]*(?<!/)>)
    (?<text>        [^<]+)
)
(?<outerHTML><(?<tagName>div)\s?(?<attributes>[^>]*?id\h*=\h*(?<quote>"|\')[^(?&quote)\v>]*\bcontent\b[^(?&quote)\v>]*(?&quote)[^>]*)> # opening tag
(?<innerHTML>
    (?: (?&comment) | (?&cdata) | (?&empty) | (?&inline) | (?&nested) | (?&unclosed) | (?&text) )*
)
</(?&tagName)>) # closing tag
@six';

preg_match_all($pattern, $content, $matches);

var_dump(array_intersect_key($matches, array(
    'tagName' => 1,
    'attributes' => 1,
    'innerHTML' => 1,
    'outerHTML' => 1
)));


Here is the output :

array(4) {
  ["outerHTML"]=>
  array(1) {
    [0]=>
    string(639) "<div id="content">
    <!-- tutu -->
    <div id="something">
        <div id="somethingElse">
            <ul>
                <li>lorem 1</li>
                <li class="dfg" toto="titi">lorem 2</li>
                <li class="dfg">lorem 3</li>
                <li class="dfg">lorem 4</li>
                <li class="dfg">lorem 5</li>
                <li class="dfg">lorem 6</li>
            </ul>
            <br />
            <div id="emptyStuff"></div>
        </div>
    </div>
    <table>
        <tr>
            <td>cell 1</td>
            <td>cell 2</td>
            <td>cell 3</td>
            <td>cell 4</td>
            <td>cell 5</td>
            <td>cell 6</td>
        </tr>
        <tr>
            <td>cell 1</td>
            <td>cell 2</td>
            <td>cell 3</td>
            <td>cell 4</td>
            <td>cell 5</td>
            <td>cell 6</td>
        </tr>
    </table>
</div>"
  }
  ["tagName"]=>
  array(1) {
    [0]=>
    string(3) "div"
  }
  ["attributes"]=>
  array(1) {
    [0]=>
    string(12) "id="content""
  }
  ["innerHTML"]=>
  array(1) {
    [0]=>
    string(615) "
    <!-- tutu -->
    <div id="something">
        <div id="somethingElse">
            <ul>
                <li>lorem 1</li>
                <li class="dfg" toto="titi">lorem 2</li>
                <li class="dfg">lorem 3</li>
                <li class="dfg">lorem 4</li>
                <li class="dfg">lorem 5</li>
                <li class="dfg">lorem 6</li>
            </ul>
            <br />
            <div id="emptyStuff"></div>
        </div>
    </div>
    <table>
        <tr>
            <td>cell 1</td>
            <td>cell 2</td>
            <td>cell 3</td>
            <td>cell 4</td>
            <td>cell 5</td>
            <td>cell 6</td>
        </tr>
        <tr>
            <td>cell 1</td>
            <td>cell 2</td>
            <td>cell 3</td>
            <td>cell 4</td>
            <td>cell 5</td>
            <td>cell 6</td>
        </tr>
    </table>
"
  }
}


I hope it will help !
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  隐瞒了意图╮        
                
              
                            
                2021-01-03 07:48
              
            
            
                                                                       
<div id=content>.*?</div>


is what you need - as long as you don't have nested divs. If you do have them, give up and use an actual XML parser.

Switch on the "dotall" option though (check http://www.regular-expressions.info/dot.html and find out how to do that with your regex flavour).

Minor details up to you. :-)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复