url1=\'www.google.com\'
url2=\'http://www.google.com\'
url3=\'http://google.com\'
url4=\'www.google\'
url5=\'http://www.google.com/images\'
url6=\'https://www.youtub
Could use regex, depending on how strict your data is. Are http and www always going to be there? Have you thought about https or w3 sites?
import re
new_url = re.sub('.*w\.', '', url, 1)
1 to not harm websites ending with a w.
edit after clarification
I'd do two steps:
if url.startswith('http'):
url = re.sub(r'https?:\\', '', url)
if url.startswith('www.'):
url = re.sub(r'www.', '', url)
A more elegant solution would be using urlparse:
from urllib.parse import urlparse
def get_hostname(url, uri_type='both'):
"""Get the host name from the url"""
parsed_uri = urlparse(url)
if uri_type == 'both':
return '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
elif uri_type == 'netloc_only':
return '{uri.netloc}'.format(uri=parsed_uri)
The first option includes https
or http
, depending on the link, and the second part netloc
includes what you were looking for.
you can use regex
url = 'http://www.google.com/images'
url = url.replace("http://www.","")
print url
or you can use regular expressions
import re
url = re.compile(r"https?://(www\.)?")
url.sub('', 'http://www.google.com/images').strip().strip('/')