Regexp for extracting all links and anchor texts from HTML

后端未结

关注

 6  913

I\'d like one or more regexes that can:

1) Take the html of a large page.

2) Find the urls contained in all links, for example:


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2020-12-15 14:47
              
            
            
                                                                       
<?

$dom = new DomDocument();
$dom->loadHTML($html);
$urls = $dom->getElementsByTagName('a');

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  甜味超标        
                
              
                            
                2020-12-15 14:56
              
            
            
                                                                       
/<a[^>]+href\s*=\s*["']([^"']+)["'][^>]*>(.*?)<\/a>/mis

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南旧        
                
              
                            
                2020-12-15 14:59
              
            
            
                                                                       
You need to take a look at look ahead and look behind.

<?php

$string = '<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>';

if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $string, $matches))
        {
        /*** if we find the word white, not followed by house ***/
        echo 'Found a match';
        print_r($matches);
    }
else
        {
        /*** if no match is found ***/
        echo 'No match found';
        }
?>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  别跟我提以往        
                
              
                            
                2020-12-15 15:02
              
            
            
                                                                       
As far as using RegEx to extract links from HTML goes, this one is pretty damn robust:

\b(((src|href|action|url) *(=|:) *(?<mh>"|'|))(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mh>|url *\( *(?<mc>"|'|)(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mc>\))

Here's one that extracts all 'plain' text (i.e. content outside tags) from HTML documents:

(<(?<tag>script|style)[\s\S]*?</\k<tag>>)|<!--[\s\S]*?-->|<[\s\S]*?>|(?<text>[^<>]*)

Test them both here: http://www.martinwardener.com/regex
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遇见更好的自我        
                
              
                            
                2020-12-15 15:03
              
            
            
                                                                       
Try something like this:

//not tested
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2020-12-15 15:04
              
            
            
                                                                       
<?php
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER))
{ foreach($matches as $match)
{// $match[2] = link address
// $match[3] = link text}
}
?>


This will extract both the link and the anchor text.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复