I notice that netloc
is empty if the URL doesn\'t have //
.
Without //
, netloc
is empty
I'm working on an application that needs to parse out the scheme and netloc from a URL that might not have any scheme set. I've settled on this approach, although it is smelly and I doubt it will handle every corner case either.
Python 3.8.0 (default, Dec 3 2019, 17:33:19)
[GCC 9.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.parse
>>> url="google.com"
>>> o = urllib.parse.urlsplit(url)
>>> u = urllib.parse.SplitResult(
... scheme=o.scheme if o.scheme else "https",
... netloc=o.netloc if o.netloc else o.path,
... path="",
... query="",
... fragment=""
... )
>>> urllib.parse.urlunsplit(u)
'https://google.com'
>>>
Would it be possible to identify netloc correctly even if // not provided in the URL?
Not by using urlparse
. This is explicitly explained in the documentation:
Following the syntax specifications in RFC 1808, urlparse recognizes a
netloc
only if it is properly introduced by//
. Otherwise the input is presumed to be a relative URL and thus to start with a path component.
If you don't want to rewrite urlparse
's logic (which I would not suggest), make sure url
starts with //
:
if not url.startswith('//'):
url = '//' + url
EDIT
The above is actually a bad solution as @alexis noted. Perhaps
if not (url.startswith('//') or url.startswith('http://') or url.startswith('https://')):
url = '//' + url
But your mileage may very with that solution as well. If you have to support a wide variety of inconsistent formats you may have to resort to regex.