Removing HTTP and WWW from URL python

后端 未结 3 1492
半阙折子戏
半阙折子戏 2021-01-11 23:22
url1=\'www.google.com\'
url2=\'http://www.google.com\'
url3=\'http://google.com\'
url4=\'www.google\'
url5=\'http://www.google.com/images\'
url6=\'https://www.youtub         


        
相关标签:
3条回答
  • 2021-01-11 23:48

    Could use regex, depending on how strict your data is. Are http and www always going to be there? Have you thought about https or w3 sites?

    import re
    new_url = re.sub('.*w\.', '', url, 1)
    

    1 to not harm websites ending with a w.

    edit after clarification

    I'd do two steps:

    if url.startswith('http'):
        url = re.sub(r'https?:\\', '', url)
    if url.startswith('www.'):
        url = re.sub(r'www.', '', url)
    
    0 讨论(0)
  • 2021-01-11 23:54

    A more elegant solution would be using urlparse:

    from urllib.parse import urlparse
    
    def get_hostname(url, uri_type='both'):
        """Get the host name from the url"""
        parsed_uri = urlparse(url)
        if uri_type == 'both':
            return '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
        elif uri_type == 'netloc_only':
            return '{uri.netloc}'.format(uri=parsed_uri)
    

    The first option includes https or http, depending on the link, and the second part netloc includes what you were looking for.

    0 讨论(0)
  • 2021-01-12 00:02

    you can use regex

    url = 'http://www.google.com/images'
    url = url.replace("http://www.","")
    print url
    

    or you can use regular expressions

    import re
    url = re.compile(r"https?://(www\.)?")
    url.sub('', 'http://www.google.com/images').strip().strip('/')
    
    0 讨论(0)
提交回复
热议问题