<?
$dom = new DomDocument();
$dom->loadHTML($html);
$urls = $dom->getElementsByTagName('a');
/<a[^>]+href\s*=\s*["']([^"']+)["'][^>]*>(.*?)<\/a>/mis
You need to take a look at look ahead and look behind.
<?php
$string = '<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>';
if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $string, $matches))
{
/*** if we find the word white, not followed by house ***/
echo 'Found a match';
print_r($matches);
}
else
{
/*** if no match is found ***/
echo 'No match found';
}
?>
As far as using RegEx to extract links from HTML goes, this one is pretty damn robust:
\b(((src|href|action|url) *(=|:) *(?<mh>"|'|))(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mh>|url *\( *(?<mc>"|'|)(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mc>\))
Here's one that extracts all 'plain' text (i.e. content outside tags) from HTML documents:
(<(?<tag>script|style)[\s\S]*?</\k<tag>>)|<!--[\s\S]*?-->|<[\s\S]*?>|(?<text>[^<>]*)
Test them both here: http://www.martinwardener.com/regex
Try something like this:
//not tested
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
<?php
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER))
{ foreach($matches as $match)
{// $match[2] = link address
// $match[3] = link text}
}
?>
This will extract both the link and the anchor text.