问题

I\'m writing a script that goes to a list of links and parses the information.

It works for most sites but It\'s choking on some with \"UnicodeEncodeError: \'ascii\' codec can\'t encode character \'\\xe9\' in position 13: ordinal not in range(128)\"

It stops on client.py which is part of urlib on python3

the exact link is: http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html

There are quite a few similar postings here but none of the answers seems to work for me.

my code is:

from urllib import request

def __request(link,debug=0):      

try:
    html = request.urlopen(link, timeout=35).read() #made this long as I was getting lots of timeouts
    unicode_html = html.decode(\'utf-8\',\'ignore\')

# NOTE the except HTTPError must come first, otherwise except URLError will also catch an HTTPError.
except HTTPError as e:
    if debug:
        print(\'The server couldn\\\'t fulfill the request for \' + link)
        print(\'Error code: \', e.code)
    return \'\'
except URLError as e:
    if isinstance(e.reason, socket.timeout):
        print(\'timeout\')
        return \'\'    
else:
    return unicode_html

this calls the request function

link = \'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html\' page = __request(link)

And the traceback is:

Traceback (most recent call last):
  File \"<string>\", line 250, in run_nodebug
  File \"C:\\reader\\get_news.py\", line 276, in <module>
    main()
  File \"C:\\reader\\get_news.py\", line 255, in main
    body = get_article_body(item[\'link\'],debug=0)
  File \"C:\\reader\\get_news.py\", line 155, in get_article_body
    page = __request(\'na\',url)
  File \"C:\\reader\\get_news.py\", line 50, in __request
    html = request.urlopen(link, timeout=35).read()
  File \"C:\\Python33\\Lib\\urllib\\request.py\", line 156, in urlopen
    return opener.open(url, data, timeout)
  File \"C:\\Python33\\Lib\\urllib\\request.py\", line 469, in open
    response = self._open(req, data)
  File \"C:\\Python33\\Lib\\urllib\\request.py\", line 487, in _open
    \'_open\', req)
  File \"C:\\Python33\\Lib\\urllib\\request.py\", line 447, in _call_chain
    result = func(*args)
  File \"C:\\Python33\\Lib\\urllib\\request.py\", line 1268, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File \"C:\\Python33\\Lib\\urllib\\request.py\", line 1248, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File \"C:\\Python33\\Lib\\http\\client.py\", line 1061, in request
    self._send_request(method, url, body, headers)
  File \"C:\\Python33\\Lib\\http\\client.py\", line 1089, in _send_request
    self.putrequest(method, url, **skips)
  File \"C:\\Python33\\Lib\\http\\client.py\", line 953, in putrequest
    self._output(request.encode(\'ascii\'))
UnicodeEncodeError: \'ascii\' codec can\'t encode character \'\\xe9\' in position 13: ordinal not in range(128)

Any help appreciated It\'s driving me crazy , I think I\'ve tried all combinations of x.decode and similar

(I could ignore the offending characters if that is possible.)

回答1:

Use a percent-encoded URL:

link = 'http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html'

I found the above percent-encoded URL by pointing the browser at

http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html

going to the page, then copying-and-pasting the encoded url supplied by the browser back into the text editor. However, you can generate a percent-encoded URL programmatically using:

from urllib import parse

link = 'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html'

scheme, netloc, path, query, fragment = parse.urlsplit(link)
path = parse.quote(path)
link = parse.urlunsplit((scheme, netloc, path, query, fragment))

which yields

http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html

回答2:

Your URL contains characters that cannot be represented as ASCII characters.

You'll have to ensure that all characters have been properly URL encoded; use urllib.parse.quote_plus for example; it'll use UTF-8 URL-encoded escaping to represent any non-ASCII characters.

来源：https://stackoverflow.com/questions/22734464/unicodeencodeerror-ascii-codec-cant-encode-character-xe9-when-using-ur

标签

python