Matching two words with some characters in between in regular expression

前端 未结 4 462
孤独总比滥情好
孤独总比滥情好 2021-01-07 14:55

I want to do a match for a string when no abc is followed by some characters (possibly none) and ends with .com.

I tried with the following

相关标签:
4条回答
  • 2021-01-07 15:42

    This looks like an XY Problem.

    DVK's answer shows you how you can tackle this problem using regular expressions, like you asked for.

    My solution (in Python) demonstrates that regular expressions are not necessarily the best approach and that tackling the problem using your programming language's string-handling functionality may produce a more efficient and more maintainable solution.

    #!/usr/bin/env python
    
    import unittest
    
    def is_valid_domain(domain):
        return domain.endswith('.com') and 'abc' not in domain
    
    class TestIsValidDomain(unittest.TestCase):
    
        def test_edu_invalid(self):
            self.assertFalse(is_valid_domain('def.edu'))
    
        def test_abc_invalid(self):
            self.assertFalse(is_valid_domain('abc.com'))
            self.assertFalse(is_valid_domain('abce.com'))
            self.assertFalse(is_valid_domain('abcAnYTHing.com'))
    
        def test_dotcom_valid(self):
            self.assertTrue(is_valid_domain('a.com'))
            self.assertTrue(is_valid_domain('b.com'))
            self.assertTrue(is_valid_domain('ab.com'))
            self.assertTrue(is_valid_domain('ae.com'))
    
    if __name__ == '__main__':
        unittest.main()
    

    See it run!

    Update

    Even in a language like Perl, where regular expressions are idiomatic, there's no reason to squash all of your logic into a single regex. A function like this would be far easier to maintain:

    sub is_domain_valid {
        my $domain = shift;
        return $domain =~ /\.com$/ && $domain !~ /abc/;
    }
    

    (I'm not a Perl programmer, but this runs and gives the results that you desire)

    0 讨论(0)
  • 2021-01-07 15:53

    Condensing:

    Sorry if I did not make myself clear. Just give some examples.
    I want def.edu, abc.com, abce.com, eabc.com and
    abcAnYTHing.com do not match,
    while a.com, b.com, ab.com, ae.com etc. match.

    New regex after revised OP examples:
    /^(?:(?!abc.*\.com\$|^def\.edu\$).)+\.(?:com|edu)\$/s

    use strict;
    use warnings;
    
    
    my @samples = qw/
     <newline>
       shouldn't_pass 
       def.edu  abc.com  abce.com eabc.com 
     <newline>
       should_pass.com
       a.com    b.com    ab.com   ae.com
       abc.edu  def.com  defa.edu
     /;
    
    my $regex = qr
      /
        ^    # Begin string
          (?:  # Group    
    
              (?!              # Lookahead ASSERTION
                    abc.*\.com$     # At any character position, cannot have these in front of us.
                  | ^def\.edu$      # (or 'def.*\.edu$')
               )               # End ASSERTION
    
               .               # This character passes
    
          )+   # End group, do 1 or more times
    
          \.   # End of string check,
          (?:com|edu)   # must be a '.com' or '.edu' (remove if not needed)
    
        $    # End string
      /sx;
    
    
    print "\nmatch using   /^(?:(?!abc.*\.com\$|^def\.edu\$).)+\.(?:com|edu)\$/s \n";
    
    for  my $str ( @samples )
    {
       if ( $str =~ /<newline>/ ) {
          print "\n"; next;
       }
    
       if ( $str =~ /$regex/ ) {
           printf ("passed - $str\n");
       }
       else {
           printf ("failed - $str\n");
       }
    }
    

    Output:

    match using /^(?:(?!abc.*.com$|^def.edu$).)+.(?:com|edu)$/s

    failed - shouldn't_pass
    failed - def.edu
    failed - abc.com
    failed - abce.com
    failed - eabc.com

    passed - should_pass.com
    passed - a.com
    passed - b.com
    passed - ab.com
    passed - ae.com
    passed - abc.edu
    passed - def.com
    passed - defa.edu

    0 讨论(0)
  • 2021-01-07 15:54

    It's unclear from your wording if you want to match a string ending with .com AND NOT containing abc before that; or to match a string that doesn't have "abc followed by characters followed by .com".

    Meaning, in the first case, "def.edu" does NOT match (no "abc" but doesn't end with ".com") but in the second case "def.edu" matches (because it's not "abcSOMETHING.com")


    In the first case, you need to use negative look-behind:

    (?<!abc.+)\.com$
    # Use .* instead of .+ if you want "abc.com" to fail as well
    

    IMPORTANT: your original expression using look-behind - #3 ( (?<!abc).*\.com ) - didn't work because look-behind ONLY looks behind immediately preceding the next term. Therefore, the "something after abc" should be included in the look-behind together with abc - as my RegEx above does.

    PROBLEM: my RegEx above likely won't work with your specific RegEx Engine, unless it supports general look-behinds with variable length expression (like the one above) - which ONLY .NET does these days (A good summary of what does and doesn't support what flavors of look-behind is at http://www.regular-expressions.info/lookaround.html ).

    If that is indeed the case, you will have to do double match: first, check for .com; capturing everything before it; then negative match on abc. I will use Perl syntax since you didn't specify a language:

    if (/^(.*)\.com$/) {
        if ($1 !~ /abc/) { 
        # Or, you can just use a substring:
        # if (index($1, "abc") < 0) {
            # PROFIT!
        }
    }
    

    In the second case, the EASIEST thing to do is to do a "does not match" operator - e.g. !~ in Perl (or negate a result of a match if your language doesn't support "does not match"). Example using pseudo-code:

    if (NOT string.match(/abc.+\.com$/)) ...
    

    Please note that you don't need ".+"/".*" when using negative lookbehind;

    0 讨论(0)
  • 2021-01-07 15:55

    Do you just want to exclude strings that start with abc? That is, would xyzabc.com be okay? If so, this regex should work:

    ^(?!abc).+\.com$
    

    If you want to make sure abc doesn't appear anywhere, use this:

    ^(?:(?!abc).)+\.com$
    

    In the first regex, the lookahead is applied only once, at the beginning of the string. In the second regex the lookahead is applied each time the . is about to match a character, ensuring that the character is not the beginning of an abc sequence.

    0 讨论(0)
提交回复
热议问题