问题
I have a text which contains characters such as "\xaf", "\xbe", which, as I understand it from this question, are ASCII encoded characters.
I want to convert them in Python to their UTF-8 equivalents. The usual string.encode("utf-8")
throws UnicodeDecodeError
. Is there some better way, e.g., with the codecs
standard library?
Sample 200 characters here.
回答1:
Your file is already a UTF-8 encoded file.
# saved encoding-sample to /tmp/encoding-sample
import codecs
fp= codecs.open("/tmp/encoding-sample", "r", "utf8")
data= fp.read()
import unicodedata as ud
chars= sorted(set(data))
for char in chars:
try:
charname= ud.name(char)
except ValueError:
charname= "<unknown>"
sys.stdout.write("char U%04x %s\n" % (ord(char), charname))
And manually filling in the unknown names:
char U000a LINE FEED
char U001e INFORMATION SEPARATOR TWO
char U001f INFORMATION SEPARATOR ONE
回答2:
.encode
is for converting a Unicode string (unicode
in 2.x, str
in 3.x) to a a byte string (str
in 2.x, bytes
in 3.x).
In 2.x, it's legal to call .encode
on a str
object. Python implicitly decodes the string to Unicode first: s.encode(e)
works as if you had written s.decode(sys.getdefaultencoding()).encode(e)
.
The problem is that the default encoding is "ascii", and your string contains non-ASCII characters. You can solve this by explicitly specifying the correct encoding.
>>> '\xAF \xBE'.decode('ISO-8859-1').encode('UTF-8')
'\xc2\xaf \xc2\xbe'
回答3:
It's not ASCII (ASCII codes only go up to 127; \xaf
is 175). You first need to find out the correct encoding, decode that, and then re-encode in UTF-8.
Could you provide an actual string sample? Then we can probably guess the current encoding.
来源:https://stackoverflow.com/questions/4736261/how-to-convert-xxy-encoded-characters-to-utf-8-in-python