问题
I received a url: https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions; it is from BeautifulSoup.
url=u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
I want to feed back into urllib2.urlopen again.
import urllib2
source = urllib2.urlopen(url).read()
The error I get:
UnicodeEncodeError: 'gbk' codec can't encode character u'\xae' in position 43: illegal multibyte sequence
Thus, I tried:
source = urllib2.urlopen(url.encode("utf-8")).read()
It got page source, however it is different from what from the original url.
originalUrl = 'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions'
originalSource = urllib2.urlopen(originalUrl).read()
originalSource == source
The result is False. Is there any idea to fix this url? How to convert u'\xae' into original ®
?
回答1:
URLs must be valid bytestring, with non-ASCII codepoints encoded correctly. You'll need to encode to UTF-8, then url quote the path of your URL:
import urllib
import urllib2
import urlparse
originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
encoded_link = parsed_link.geturl()
source = urllib2.urlopen(encoded_link).read()
Demo:
>>> import urllib
>>> import urllib2
>>> import urlparse
>>> originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
>>> parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
>>> parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
>>> encoded_link = parsed_link.geturl()
>>> encoded_link
'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp%C2%AE-75-desktop-virtualization-solutions'
>>> source = urllib2.urlopen(encoded_link).read()
>>> len(source)
68758
来源:https://stackoverflow.com/questions/26615374/how-to-deal-with-in-url-for-urllib2-urlopen