问题
I am working on an app which needs to parse URLs (mostly HTTP URLs) in HTML pages - I have no control over the input and some of it is, as expected, a bit messy.
One problem I'm encountering frequently is that urlparse is very strict (and possibly even buggy?) when it comes to parsing and joining URLs that have double-slashes in the path part, for example:
testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl,
urlparse.urlparse(testUrl).path)
Instead of the expected result http://www.example.com//path
(or even better, with a normalized single slash), I end up with http://path
.
BTW the reason I'm running such code is because it's the only way I found so far to strip the query / fragment part off of URLs. Maybe there is a better way to do it, but I couldn't find one.
Can anyone recommend a way to avoid this, or should I just normalize the path myself using a (relatively simple, I know) regex?
回答1:
If you only want to get the url without the query part, I would skip the urlparse module and just do:
testUrl.rsplit('?')
The url will be at index 0 of the list returned and the query at index 1.
It is not possible to have two '?' in an url so it should work for all urls.
回答2:
The path (//path
) alone is not valid, which confuses the function and gets interpreted as a hostname
http://tools.ietf.org/html/rfc3986.html#section-3.3
If a URI does not contain an authority component, then the path cannot begin with two slash characters ("//").
I don't particularly like either of these solutions, but they work:
import re
import urlparse
testurl = 'http://www.example.com//path?foo=bar'
parsed = list(urlparse.urlparse(testurl))
parsed[2] = re.sub("/{2,}", "/", parsed[2]) # replace two or more / with one
cleaned = urlparse.urlunparse(parsed)
print cleaned
# http://www.example.com/path?foo=bar
print urlparse.urljoin(
testurl,
urlparse.urlparse(cleaned).path)
# http://www.example.com//path
Depending on what you are doing, you could do the joining manually:
import re
import urlparse
testurl = 'http://www.example.com//path?foo=bar'
parsed = list(urlparse.urlparse(testurl))
newurl = ["" for i in range(6)] # could urlparse another address instead
# Copy first 3 values from
# ['http', 'www.example.com', '//path', '', 'foo=bar', '']
for i in range(3):
newurl[i] = parsed[i]
# Rest are blank
for i in range(4, 6):
newurl[i] = ''
print urlparse.urlunparse(newurl)
# http://www.example.com//path
回答3:
It is mentioned in official urlparse docs that:
If url is an absolute URL (that is, starting with // or scheme://), the url‘s host name and/or scheme will be present in the result. For example
urljoin('http://www.cwi.nl/%7Eguido/Python.html',
... '//www.python.org/%7Eguido')
'http://www.python.org/%7Eguido'
If you do not want that behavior, preprocess the url with urlsplit() and urlunsplit(), removing possible scheme and netloc parts.
So you can do :
urlparse.urljoin(testUrl,
urlparse.urlparse(testUrl).path.replace('//','/'))
Output = 'http://www.example.com/path'
回答4:
Can not that be a solution?
urlparse.urlparse(testUrl).path.replace('//', '/')
回答5:
Try This:
def http_normalize_slashes(url):
url = str(url)
segments = url.split('/')
correct_segments = []
for segment in segments:
if segment != '':
correct_segments.append(segment)
first_segment = str(correct_segments[0])
if first_segment.find('http') == -1:
correct_segments = ['http:'] + correct_segments
correct_segments[0] = correct_segments[0] + '/'
normalized_url = '/'.join(correct_segments)
return normalized_url
Example URLs:
print(http_normalize_slashes('http://www.example.com//path?foo=bar'))
print(http_normalize_slashes('http:/www.example.com//path?foo=bar'))
print(http_normalize_slashes('www.example.com//x///c//v///path?foo=bar'))
print(http_normalize_slashes('http://////www.example.com//x///c//v///path?foo=bar'))
Will return:
http://www.example.com/path?foo=bar
http://www.example.com/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar
Hope it helps.. :)
回答6:
This answer seemed to give the best results in the cases I tried for correcting double slashes in paths, without touching the initial double slash in the http:// bit.
here's the code:
from urlparse import urljoin
from functools import reduce
def slash_join(*args):
return reduce(urljoin, args).rstrip("/")
回答7:
i've adopted to my needs @yunhasnawa's answer. here is part:
import urllib2
from urlparse import urlparse, urlunparse
def sanitize_url(url):
url_parsed = urlparse(url)
return urlunparse((url_parsed.scheme, url_parsed.netloc, avoid_double_slash(url_parsed.path), '', '', ''))
def avoid_double_slash(path):
parts = path.split('/')
not_empties = [part for part in parts if part]
return '/'.join(not_empties)
>>> sanitize_url('https://hostname.doma.in:8443/complex-path////next//')
'https://hostname.doma.in:8443/complex-path/next'
来源:https://stackoverflow.com/questions/8925938/url-parsing-in-python-normalizing-double-slash-in-paths