问题

how can I write Regular Rxpression to search for a string that contains "http://" AND does not contain "mysite.com"?

回答1:

WARNING

Attempting to rope regexes into boolean logic best accomplished in a proper programming language is a thankless job. While it is possible to write /PAT1/ and not /PAT2/ using complex lookaheads so that it is just one pattern, it is a painful task. You don’t to do it this way!

You should have explained what you were really doing in the first place — some sort of match operation in a text editor. You didn’t. So you get a general answer that is going to be challenging to adapt to your localized situation.

Quick Answer

(?sx)                 # let dot cross newlines, enable comments & whitspace
(?= .* http://     )  # lookahead assertion for http://
(?! .* mysite\.com )  # lookahead negation  for mysite.com

Using Perl syntax, you could stick that (pre-)compiled pattern into a variable for future use this way:

my $is_valid_rx = qr{
    (?= .* http://     )  # lookahead assertion for http://
    (?! .* mysite\.com )  # lookahead negation  for mysite.com
}sx;                      # /s to cross newlines, /x for comments & whitespace

# then later on…
if ($some_string =~ $is_valid_rx) { 
     # your string has an http blah and lacks a mysite blah
}

However, if your goal is to pull out all such links, that isn’t going to help you, because those lookaheads do not tell you where in the string your link occurs.

In that case, it’s a lot easier to write something to pull out the links and then filter out your unwanted cases after that, using two separate regexes instead of trying to make do everything.

 @all_links = ($some_string =~ m{ https?://\S+ }xg);
 @good_links = grep !/mysite\.com/, @all_links;

Note that no attempt is made to match only links that contain valid URL characters, or that there is no accidental trailing punctuation as so often occurs in plain text.

And now, for a real answer

Note also that if you’re parsing HTML with this, the approach outlined above is just a quick-and-dirty, fast-and-loose, shoot-from-the-hip kind of link extraction. It’s easy to construct valid input that turns up a lot of false positives, and not altogether hard to construct input that produces false negatives, too.

Here, in contrast, is a full program that dumps out all the <a ...> and <img ...> link address in its URL arguments, and actually does so correctly because it uses a real parser.

#!/usr/bin/env perl
#
# fetchlinks - fetch all <a> and <img> links from listed URL args
# Tom Christiansen <tchrist@perl.com>
# Wed Mar 14 08:03:53 MDT 2012
#
use strict;
use warnings;

use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;

die "usage: $0 url ...\n" unless @ARGV;

for my $arg (@ARGV) {
    my @links = fetch_wanted_links($arg => qw<a img>);
    for my $link (@links) {
        print "$arg => " if @ARGV > 1;
        print "$link\n";
    }
}

exit;

sub fetch_wanted_links {
    my($url, @wanted) = @_;

    my %wanted;
    @wanted{@wanted} = (1) x @wanted;

    my $agent = LWP::UserAgent->new;

    # Set up a callback that collect links of the wanted variety
    my @hits = ();

    # Make the parser.  Unfortunately, we don't know the base yet
    # (it might be different from $url)
    my $parser = new HTML::LinkExtor sub {
       my($tag, %attr) = @_;
       return if %wanted and not $wanted{$tag};
       push @hits, values %attr;
    };

    # Request document and parse it as it arrives
    my $response = $agent->request(
           HTTP::Request->new(GET => $url),
           sub { $parser->parse( $_[0] ) },
    );

    # Expand all image URLs to absolute ones
    my $base = $response->base;
    @hits = map { $_ = url($_, $base)->abs } @hits;
    return @hits;
}

If you run it on a URL like this, it gives this accounting of all the anchor and image links:

$ perl fetchlinks http://www.perl.org/
http://www.perl.org/
http://st.pimg.net/perlweb/images/camel_head.v25e738a.png
http://www.perl.org/
http://www.perl.org/learn.html
http://www.perl.org/docs.html
http://www.perl.org/cpan.html
http://www.perl.org/community.html
http://www.perl.org/contribute.html
http://www.perl.org/about.html
http://www.perl.org/get.html
http://www.perl.org/get.html
http://www.perl.org/get.html
http://www.perl.org/about.html
http://www.perl.org/learn.html
http://st.pimg.net/perlweb/images/icons/learn.v0e1f83c.png
http://www.perl.org/learn.html
http://www.perl.org/community.html
http://st.pimg.net/perlweb/images/icons/community.v03bf8ce.png
http://www.perl.org/community.html
http://www.perl.org/docs.html
http://st.pimg.net/perlweb/images/icons/docs.v2622a01.png
http://www.perl.org/docs.html
http://www.perl.org/contribute.html
http://st.pimg.net/perlweb/images/icons/cog.v08b9acc.png
http://www.perl.org/contribute.html
http://www.perl.org/dev.html
http://www.perl.org/contribute.html
http://www.perl.org/cpan.html
http://st.pimg.net/perlweb/images/icons/cpan.vdc5be93.png
http://www.perl.org/cpan.html
http://www.perl.org/events.html
http://st.pimg.net/perlweb/images/icons/cal.v705acef.png
http://www.perl.org/events.html
http://www.perl6.org/
http://st.pimg.net/perlweb/images/icons/perl6.v8ff6c63.png
http://www.perl6.org/
http://www.perl.org/dev.html
http://www.perlfoundation.org/
http://st.pimg.net/perlweb/images/icons/onion.vee5cb98.png
http://www.perlfoundation.org/
http://www.cpan.org/
http://search.cpan.org/~jtang/Net-Stomp-0.45/
http://search.cpan.org/~vaxman/Array-APX-0.3/
http://search.cpan.org/~salva/Net-SFTP-Foreign-1.71/
http://search.cpan.org/~grandpa/Win32-MSI-HighLevel-1.0008/
http://search.cpan.org/~teejay/Catalyst-TraitFor-Component-ConfigPerSite-0.06/
http://search.cpan.org/~jwieland/WebService-Embedly-0.04/
http://search.cpan.org/~mariab/WWW-TMDB-API0.04/
http://search.cpan.org/~teejay/SOAP-Data-Builder-1/
http://search.cpan.org/~dylan/WWW-Google-Translate-0.03/
http://search.cpan.org/~jtbraun/Parse-RecDescent-1.967_008/
http://www.perl.org/get.html
http://www.perl.org/learn.html
http://www.perl.org/docs.html
http://www.perl.org/community.html
http://www.perl.org/events.html
http://www.perl.org/siteinfo.html#sponsors
http://www.yellowbot.com/
http://st.pimg.net/perlweb/images/friends/yellowbot.vcc29f5b.gif
http://www.perl.org/
http://blogs.perl.org/
http://jobs.perl.org/
http://learn.perl.org/
http://dev.perl.org/
http://creativecommons.org/licenses/by-nc-nd/3.0/us/
http://i.creativecommons.org/l/by-nc-nd/3.0/us/80x15.png
http://www.perl.org/siteinfo.html

For any work for serious than running a quick grep over a file to eyeball general results, you need to use a proper parser to do this sort of thing.