How can I prepend http to a url if it doesn't begin with http?

前端 未结 4 1601
陌清茗
陌清茗 2021-01-08 00:21

I have urls formatted as:

google.com
www.google.com
http://google.com
http://www.google.com

I would like to convert all type of links to a

相关标签:
4条回答
  • 2021-01-08 00:27

    I found it easy to detect the protocol with regex and then append it if missing:

    import re
    def formaturl(url):
        if not re.match('(?:http|ftp|https)://', url):
            return 'http://{}'.format(url)
        return url
    
    url = 'test.com'
    print(formaturl(url)) # http://test.com
    
    url = 'https://test.com'
    print(formaturl(url)) # https://test.com
    

    I hope it helps!

    0 讨论(0)
  • 2021-01-08 00:36
    def fix_url(orig_link):
        # force scheme 
        split_comps = urlsplit(orig_link, scheme='https')
        # fix netloc (can happen when there is no scheme)
        if not len(split_comps.netloc):
            if len(split_comps.path):
                # override components with fixed netloc and path
                split_comps = SplitResult(scheme='https',netloc=split_comps.path,path='',query=split_comps.query,fragment=split_comps.fragment)
            else: # no netloc, no path 
                raise ValueError
        return urlunsplit(split_comps)
    
    0 讨论(0)
  • 2021-01-08 00:39

    Python do have builtin functions to treat that correctly, like

    p = urlparse.urlparse(my_url, 'http')
    netloc = p.netloc or p.path
    path = p.path if p.netloc else ''
    if not netloc.startswith('www.'):
        netloc = 'www.' + netloc
    
    p = urlparse.ParseResult('http', netloc, path, *p[3:])
    print(p.geturl())
    

    If you want to remove (or add) the www part, you have to edit the .netloc field of the resulting object before calling .geturl().

    Because ParseResult is a namedtuple, you cannot edit it in-place, but have to create a new object.

    PS:

    For Python3, it should be urllib.parse.urlparse

    0 讨论(0)
  • 2021-01-08 00:43

    For the formats that you mention in your question, you can do something as simple as:

    def convert(url):
        if url.startswith('http://www.'):
            return 'http://' + url[len('http://www.'):]
        if url.startswith('www.'):
            return 'http://' + url[len('www.'):]
        if not url.startswith('http://'):
            return 'http://' + url
        return url
    

    But please note that there are probably other formats that you are not anticipating. In addition, keep in mind that the output URL (according to your definitions) will not necessarily be a valid one (i.e., the DNS will not be able to translate it into a valid IP address).

    0 讨论(0)
提交回复
热议问题