regex needed to strip out domain name

后端 未结 5 1019
终归单人心
终归单人心 2020-12-20 01:09

I need a regexp to strip out just the domain name part of a url. So for example if i had the following url:

http://www.website-2000.com

the bit I\'d want the

相关标签:
5条回答
  • This one should work. There might be some faults with it, but none that I can think of right now. If anyone want to improve on it, feel free to do so.

    /http:\/\/(?:www\.)?([a-z0-9\-]+)(?:\.[a-z\.]+[\/]?).*/i
    
    http:\/\/            matches the "http://" part
    (?:www\.)?           is a non-capturing group that matches zero or one "www."
    ([a-z0-9\-]+)        is a capturing group that matches character ranges a-z, 0-9
                         in addition to the hyphen. This is what you wanted to extract.
    (?:\.[a-z\.]+[\/]?)  is a non-capturing group that matches the TLD part (i.e. ".com",
                         ".co.uk", etc) in addition to zero or one "/"
    .*                   matches the rest of the url
    

    http://rubular.com/r/ROz13NSWBQ

    0 讨论(0)
  • 2020-12-20 02:01
    http://wwww.([^/]+)
    

    No need to use regexp, use the urlparse module

    >>> from urlparse import urlparse
    >>> '.'.join(urlparse("http://www.website-2000.com").netloc.split('.')[-2:])
    'website-2000.com'
    

    0 讨论(0)
  • 2020-12-20 02:02

    This one allows you not to have to worry about any of the http/https/ftp etc... in front and also captures all your subdomains too.

    (?:www\.)?([a-z0-9\-.]+)(?:\.[a-z\.]+[\/]?).*/i
    

    The only times it fails that I've found are: - If a . precedes the domain/subdomain without any text before it, the . is included in the regex capture. - Emails with . in them will not work. (fix this by checking passed domain first for the @ symbol before running through regex) - Whitespace in the middle of the domain/subdomain

    0 讨论(0)
  • 2020-12-20 02:02
    r/^[^:]+:\/\/[^/?#]+//
    

    This worked for me.

    It will match any scheme or protocol and then after the :// matches any character that's not a / ? or #. These three characters, when they first occur in a URL, signal the end of the domain so that's were I end the match.

    0 讨论(0)
  • 2020-12-20 02:12

    Let me introduce you this wonderful tool txt2re: regular expression generator

    Here you can experiment with regex and generate code in many languages.

    0 讨论(0)
提交回复
热议问题