RegEx expression to find a href links and add NoFollow to them

后端 未结 3 546
说谎
说谎 2020-12-11 22:41

I am trying to write a RegEx rule to find all a href HTML links on my webpage and add a \'rel=\"nofollow\"\' to them.

However, I have a list of URLs that must be exc

相关标签:
3条回答
  • 2020-12-11 23:17

    An improvement to James' regex:

    (<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>
    

    This regex will matches links NOT in the string array $follow_list. The strings don't need a leading 'www'. :) The advantage is that this regex will preserve other arguments in the tag (like target, style, title...). If a rel argument already exists in the tag, the regex will NOT match, so you can force follows on urls not in $follow_list

    Replace the with:

    $1$2$3"$4 rel="nofollow">
    

    Full example (PHP):

    function dont_follow_links( $html ) {
     // follow these websites only!
     $follow_list = array(
      'google.com',
      'mypage.com',
      'otherpage.com',
     );
     return preg_replace(
      '%(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>%',
      '$1$2$3"$4 rel="nofollow">',
      $html);
    }
    

    If you want to overwrite rel no matter what, I would use a preg_replace_callback approach where in the callback the rel attribute is replaced separately:

    $subject = preg_replace_callback('%(<a\s*[^>]*href="https?://(?:(?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"[^>]*)>%', function($m) {
        return preg_replace('%\srel\s*=\s*(["\'])(?:(?!\1).)*\1(\s|$)%', ' ', $m[1]).' rel="nofollow">';
    }, $subject);
    
    0 讨论(0)
  • 2020-12-11 23:18
    (<a href="https?://)((?:(?!\b(pokerdiy.com|www\.example\.com/link\.aspx)\b)[^"])+)"
    

    would match the first part of any link that starts with http:// or https:// and doesn't contain pokerdiy.com or www.example.com/link.aspx anywhere in the href attribute. Replace that by

    \1\2" rel="nofollow"
    

    If a rel="nofollow" is already present, you'll end up with two of these. And of course, relative links or other protocols like ftp:// etc. won't be matched at all.

    Explanation:

    (?!\b(foo|bar)\b)[^"] matches any non-" character unless it it possible to match foo or bar at the current location. The \bs are there to make sure we don't accidentally trigger on rebar or foonly.

    This whole contruct is repeated ((?: ... )+), and whatever is matched is preserved in backreference \2.

    Since the next token to be matched is a ", the entire regex fails if the attribute contains foo or bar anywhere.

    0 讨论(0)
  • 2020-12-11 23:40

    I've developed a slightly more robust version that can detect whether the anchor tag already has "rel=" in it, therefore not duplicating attributes.

    (<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!blog.bandit.co.nz)[^"]+)"([^>]*)>
    

    Matches

    <a href="http://google.com">Google</a>
    <a title="Google" href="http://google.com">Google</a>
    <a target="_blank" href="http://google.com">Google</a>
    <a href="http://google.com" title="Google" target="_blank">Google</a>
    

    But doesn't match

    <a rel="nofollow" href="http://google.com">Google</a>
    <a href="http://google.com" rel="nofollow">Google</a>
    <a href="http://google.com" rel="nofollow" title="Google" target="_blank">Google</a>
    <a href="http://google.com" title="Google" target="_blank" rel="nofollow">Google</a>
    <a href="http://google.com" title="Google" rel="nofollow" target="_blank">Google</a>
    <a target="_blank" href="http://blog.bandit.co.nz">Bandit</a>
    

    Replace using

    $1$2$3"$4 rel="nofollow">
    

    Hope this helps someone!

    James

    0 讨论(0)
提交回复
热议问题