Regexp for extracting all links and anchor texts from HTML

后端 未结 6 913
说谎
说谎 2020-12-15 14:31

I\'d like one or more regexes that can:

1) Take the html of a large page.

2) Find the urls contained in all links, for example:



        
相关标签:
6条回答
  • 2020-12-15 14:47
    <?
    
    $dom = new DomDocument();
    $dom->loadHTML($html);
    $urls = $dom->getElementsByTagName('a');
    
    0 讨论(0)
  • 2020-12-15 14:56
    /<a[^>]+href\s*=\s*["']([^"']+)["'][^>]*>(.*?)<\/a>/mis
    
    0 讨论(0)
  • 2020-12-15 14:59

    You need to take a look at look ahead and look behind.

    <?php
    
    $string = '<a href="http://example1.com">Test 1</a>
    <a class="foo" id="bar" href="http://example2.com">Test 2</a>
    <a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>';
    
    if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $string, $matches))
            {
            /*** if we find the word white, not followed by house ***/
            echo 'Found a match';
            print_r($matches);
        }
    else
            {
            /*** if no match is found ***/
            echo 'No match found';
            }
    ?>
    
    0 讨论(0)
  • 2020-12-15 15:02

    As far as using RegEx to extract links from HTML goes, this one is pretty damn robust:

    \b(((src|href|action|url) *(=|:) *(?<mh>"|'|))(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mh>|url *\( *(?<mc>"|'|)(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mc>\))

    Here's one that extracts all 'plain' text (i.e. content outside tags) from HTML documents:

    (<(?<tag>script|style)[\s\S]*?</\k<tag>>)|<!--[\s\S]*?-->|<[\s\S]*?>|(?<text>[^<>]*)

    Test them both here: http://www.martinwardener.com/regex

    0 讨论(0)
  • 2020-12-15 15:03

    Try something like this:

    //not tested
    $regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
    
    0 讨论(0)
  • 2020-12-15 15:04
    <?php
    $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
    if(preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER))
    { foreach($matches as $match)
    {// $match[2] = link address
    // $match[3] = link text}
    }
    ?>
    

    This will extract both the link and the anchor text.

    0 讨论(0)
提交回复
热议问题