I am trying to write a RegEx rule to find all a href HTML links on my webpage and add a \'rel=\"nofollow\"\' to them.
However, I have a list of URLs that must be exc
An improvement to James' regex:
(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>
This regex will matches links NOT in the string array $follow_list. The strings don't need a leading 'www'. :)
The advantage is that this regex will preserve other arguments in the tag (like target, style, title...). If a rel
argument already exists in the tag, the regex will NOT match, so you can force follows on urls not in $follow_list
Replace the with:
$1$2$3"$4 rel="nofollow">
Full example (PHP):
function dont_follow_links( $html ) {
// follow these websites only!
$follow_list = array(
'google.com',
'mypage.com',
'otherpage.com',
);
return preg_replace(
'%(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>%',
'$1$2$3"$4 rel="nofollow">',
$html);
}
If you want to overwrite rel
no matter what, I would use a preg_replace_callback
approach where in the callback the rel attribute is replaced separately:
$subject = preg_replace_callback('%(<a\s*[^>]*href="https?://(?:(?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"[^>]*)>%', function($m) {
return preg_replace('%\srel\s*=\s*(["\'])(?:(?!\1).)*\1(\s|$)%', ' ', $m[1]).' rel="nofollow">';
}, $subject);
(<a href="https?://)((?:(?!\b(pokerdiy.com|www\.example\.com/link\.aspx)\b)[^"])+)"
would match the first part of any link that starts with http://
or https://
and doesn't contain pokerdiy.com
or www.example.com/link.aspx
anywhere in the href
attribute. Replace that by
\1\2" rel="nofollow"
If a rel="nofollow"
is already present, you'll end up with two of these. And of course, relative links or other protocols like ftp://
etc. won't be matched at all.
Explanation:
(?!\b(foo|bar)\b)[^"]
matches any non-"
character unless it it possible to match foo
or bar
at the current location. The \b
s are there to make sure we don't accidentally trigger on rebar
or foonly
.
This whole contruct is repeated ((?: ... )+
), and whatever is matched is preserved in backreference \2
.
Since the next token to be matched is a "
, the entire regex fails if the attribute contains foo
or bar
anywhere.
I've developed a slightly more robust version that can detect whether the anchor tag already has "rel=" in it, therefore not duplicating attributes.
(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!blog.bandit.co.nz)[^"]+)"([^>]*)>
Matches
<a href="http://google.com">Google</a>
<a title="Google" href="http://google.com">Google</a>
<a target="_blank" href="http://google.com">Google</a>
<a href="http://google.com" title="Google" target="_blank">Google</a>
But doesn't match
<a rel="nofollow" href="http://google.com">Google</a>
<a href="http://google.com" rel="nofollow">Google</a>
<a href="http://google.com" rel="nofollow" title="Google" target="_blank">Google</a>
<a href="http://google.com" title="Google" target="_blank" rel="nofollow">Google</a>
<a href="http://google.com" title="Google" rel="nofollow" target="_blank">Google</a>
<a target="_blank" href="http://blog.bandit.co.nz">Bandit</a>
Replace using
$1$2$3"$4 rel="nofollow">
Hope this helps someone!
James