I need a regexp to strip out just the domain name part of a url. So for example if i had the following url:
http://www.website-2000.com
the bit I\'d want the
This one should work. There might be some faults with it, but none that I can think of right now. If anyone want to improve on it, feel free to do so.
/http:\/\/(?:www\.)?([a-z0-9\-]+)(?:\.[a-z\.]+[\/]?).*/i
http:\/\/ matches the "http://" part
(?:www\.)? is a non-capturing group that matches zero or one "www."
([a-z0-9\-]+) is a capturing group that matches character ranges a-z, 0-9
in addition to the hyphen. This is what you wanted to extract.
(?:\.[a-z\.]+[\/]?) is a non-capturing group that matches the TLD part (i.e. ".com",
".co.uk", etc) in addition to zero or one "/"
.* matches the rest of the url
http://rubular.com/r/ROz13NSWBQ
http://wwww.([^/]+)
No need to use regexp, use the urlparse module
>>> from urlparse import urlparse
>>> '.'.join(urlparse("http://www.website-2000.com").netloc.split('.')[-2:])
'website-2000.com'
This one allows you not to have to worry about any of the http/https/ftp etc... in front and also captures all your subdomains too.
(?:www\.)?([a-z0-9\-.]+)(?:\.[a-z\.]+[\/]?).*/i
The only times it fails that I've found are: - If a . precedes the domain/subdomain without any text before it, the . is included in the regex capture. - Emails with . in them will not work. (fix this by checking passed domain first for the @ symbol before running through regex) - Whitespace in the middle of the domain/subdomain
r/^[^:]+:\/\/[^/?#]+//
This worked for me.
It will match any scheme or protocol and then after the :// matches any character that's not a / ? or #. These three characters, when they first occur in a URL, signal the end of the domain so that's were I end the match.
Let me introduce you this wonderful tool txt2re: regular expression generator
Here you can experiment with regex and generate code in many languages.