Python: open a URL with accent

问题

In Python 2.7, I want to open a URL which contains accents (the link itself, not the page to which it's pointing). If I use the following:

#!/usr/bin/env Python
# -*- coding: utf-8 -*-

import urllib2


test = "https://www.notifymydevice.com/push?ApiKey=K6HGFJJCCQE04G29OHSRBIXI&PushTitle=Les%20accents%20:%20éèçà&PushText=Messages%20éèçà&"

urllib2.urlopen(test)

My accents are converted to gibberish (Ã, ¨, ©, etc rather than the éèà I expect).

I've searched for that kind of issue and so I tried with urllib2.urlopen(test.encode('utf-8')) but Python throws an error in that case:

File "test.py", line 10, in urllib2.urlopen(test.encode('utf8')) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 98: ordinal not in range(128)

回答1:

Prefix the string with a u. I get no errors trying it out in repl using this

import urllib
test = u'https://www.notifymydevice.com/push?ApiKey=K6HGFJJCCQE04G29OHSRBIXI&PushTitle=Les%20accents%20:%20éèçà&PushText=Messages%20éèçà&'
urllib.urlopen(test.encode("UTF-8"))

The u prefix is for unicode strings

回答2:

If you call encode on a str, Python has to first decode it to unicode so it can encode that Unicode to UTF-8. And to decode it, it has to guess what encoding you used, because you didn't tell it. So it guesses 'ascii' (actually, it guesses whatever sys.getdefaultencoding() says, but that's usually 'ascii'), which fails.

At any rate, there are two possible problems here, which have different solutions. So, you need to figure out which one you have, before trying to fix it.

Try printing out the individual bytes of the string—or, more simply, the repr:

print repr(test)

If the é shows up as \xc3\xa9, it's UTF-8.
If it shows up as \xe9, it's Latin-1 (or cp1252 or something else Latin-1-compatible).
If it shows up as something else, it's a different character set, and you'll have to work out which one.

If you're giving Python Latin-1 source and telling it it's UTF-8, it won't complain, but it means you'll be sending Latin-1 bytes where you think you're sending UTF-8 characters, and you'll get mojibake all over the place.

The fix is to save the source code as UTF-8 in your text editor.

If it already is UTF-8, then the problem is that the server isn't expecting the URL to be UTF-8.

The URL standards don't mandate any particular meaning for (%-encoded) non-ASCII bytes; any server can do anything it wants with them. And if you're talking to a server that treats such bytes as, say, cp1252, but you're sending it UTF-8, you're going to get mojibake.

The fix for this is to reconfigure the server to handle UTF-8 if you control the server, or to send strings in the character set the server wants if you don't.

来源：https://stackoverflow.com/questions/51715290/python-open-a-url-with-accent

标签

python

utf-8

urllib2