Python 3: `netloc` value in `urllib.parse` is empty if URL doesn't have `//`

前端 未结 2 1278
盖世英雄少女心
盖世英雄少女心 2021-01-14 02:54

I notice that netloc is empty if the URL doesn\'t have //.

Without //, netloc is empty



        
相关标签:
2条回答
  • 2021-01-14 03:47

    I'm working on an application that needs to parse out the scheme and netloc from a URL that might not have any scheme set. I've settled on this approach, although it is smelly and I doubt it will handle every corner case either.

    Python 3.8.0 (default, Dec  3 2019, 17:33:19)
    [GCC 9.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import urllib.parse
    >>> url="google.com"
    >>> o = urllib.parse.urlsplit(url)
    >>> u = urllib.parse.SplitResult(
    ...     scheme=o.scheme if o.scheme else "https",
    ...     netloc=o.netloc if o.netloc else o.path,
    ...     path="",
    ...     query="",
    ...     fragment=""
    ... )
    >>> urllib.parse.urlunsplit(u)
    'https://google.com'
    >>>
    
    0 讨论(0)
  • 2021-01-14 03:56

    Would it be possible to identify netloc correctly even if // not provided in the URL?

    Not by using urlparse. This is explicitly explained in the documentation:

    Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by //. Otherwise the input is presumed to be a relative URL and thus to start with a path component.


    If you don't want to rewrite urlparse's logic (which I would not suggest), make sure url starts with //:

    if not url.startswith('//'):
        url = '//' + url
    

    EDIT

    The above is actually a bad solution as @alexis noted. Perhaps

    if not (url.startswith('//') or url.startswith('http://') or url.startswith('https://')):
        url = '//' + url
    

    But your mileage may very with that solution as well. If you have to support a wide variety of inconsistent formats you may have to resort to regex.

    0 讨论(0)
提交回复
热议问题