问题
I parsed a HTML document and have Russian text in it. When I'm trying to print it in Python, I get this:
ÐлÑбниÑнÑй новогодний пÑнÑ
I tried to decode it and I get ISO-8859-1 encoding. I'm trying to decode it like that:
print drink_name.decode('iso8859-1')
But I get an error. How can I print this text, or encode it in Unicode?
回答1:
You have a Mojibake; UTF-8 bytes decoded as Latin-1 or CP1251 in this case.
You can repair it by reversing the process:
>>> print u'ÐлÑбниÑнÑй новогодний пÑнÑ'.encode('latin1').decode('utf8')
Клубничный новогодний пунш
(I had to copy the string from the original post source to capture all the non-printable bytes in the Mojibake).
The better method would be to not incorrectly decoding in the first place. You decoded the original text with the wrong encoding, use UTF-8 as the codec instead.
If you used requests
to download the page, do not use response.text
in this case; if the server failed to specific codec then the HTTP RFC default is to use Latin-1, but HTML documents often embed the encoding in a <meta>
header instead. Leave decoding in such cases to your parser, like BeautifulSoup:
response = requests.get(url)
soup = BeautifulSoup(response.content) # pass in undecoded bytes
来源:https://stackoverflow.com/questions/26869933/russian-symbols-in-python-output-corrupted-encoding