how to deal with ® in url for urllib2.urlopen?

前端 未结 1 724
花落未央
花落未央 2020-12-04 00:45

I received a url: https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions; it is from BeautifulSoup.

url=u\'http         


        
相关标签:
1条回答
  • 2020-12-04 01:28

    URLs must be valid bytestring, with non-ASCII codepoints encoded correctly. You'll need to encode to UTF-8, then url quote the path of your URL:

    import urllib
    import urllib2
    import urlparse
    
    originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
    parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
    parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
    encoded_link = parsed_link.geturl()
    source = urllib2.urlopen(encoded_link).read()
    

    Demo:

    >>> import urllib
    >>> import urllib2 
    >>> import urlparse
    >>> originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
    >>> parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
    >>> parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
    >>> encoded_link = parsed_link.geturl()
    >>> encoded_link
    'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp%C2%AE-75-desktop-virtualization-solutions'
    >>> source = urllib2.urlopen(encoded_link).read()
    >>> len(source)
    68758
    
    0 讨论(0)
提交回复
热议问题