问题
I have scrap/curl request to get html from other site, that have chinese language but some text result is weird, it showing like this:
°¢Àï°Í°ÍΪÄúÌṩÁË×ÔÁµÕß¹¤³§Ö±ÏúÆ·ÅƵç×Ó±í ÖÇÄÜʱÉг±Á÷ŮʿÊÖ»·ÊÖÁ´Ê×Êαí´øµÈ²úÆ·£¬ÕâÀïÔƼ¯ÁËÖÚ¶àµÄ¹©Ó¦ÉÌ£¬²É¹ºÉÌ£¬ÖÆÔìÉÌ¡£ÓûÁ˽â¸ü¶à×ÔÁµÕß¹¤³§Ö±ÏúÆ·ÅƵç×Ó±í ÖÇÄÜʱÉг±Á÷ŮʿÊÖ»·ÊÖÁ´Ê×Êαí´øÐÅÏ¢£¬Çë·ÃÎÊ°¢Àï°Í°ÍÅú·¢Íø£¡
that should be in chinese language, and this is my code:
str(result.decode('ISO-8859-1'))
If without decode 'ISO-8859-1' (only return result variable) it will display question mark like this:
����Ͱ�Ϊ���ṩ�������߹���ֱ��Ʒ�Ƶ��ӱ� ����ʱ�г���Ůʿ�ֻ��������α����Ȳ�Ʒ�������Ƽ����ڶ�Ĺ�Ӧ�̣��ɹ��̣������̡����˽���������߹���ֱ��Ʒ�Ƶ��ӱ� ����ʱ�г���Ůʿ�ֻ��������α�����Ϣ������ʰ���Ͱ���������
Could you help me which encode/decode that I should use?
Thanks
回答1:
Chinese has several possible charsets.
3 common chinese charsets are: gb2312,big5 and gbk.
Here is a snippet to convert from gb2312
to utf-8
.
import codecs
infile = codecs.open("in.txt", "r", "gb2312")
lines = infile.readline()
infile.close()
print(lines)
outfile = codecs.open("out.txt", "wb", "utf-8")
outfile.writelines(lines)
outfile.close()
回答2:
It was really simple solution, as mentioned by @Thu Yein tun, to see the header response of the http request link for the content type, and I it showing as text/html;charset=GBK, then I give the solution to my code like this
result.decode('gbk')
回答3:
Try this block of code.
You can do by importing the unquote
file & encode the content using latin1
encoding mechanism.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib2 import unquote
bytesquoted = u'å%8f°å%8d%97 親å%90é¤%90廳'.encode('latin1')
unquoted = unquote(bytesquoted)
print unquoted.decode('utf8')
Output :
台南 親子餐廳
来源:https://stackoverflow.com/questions/53954604/python-encoding-chinese-to-special-character