Convert GBK to utf8 string in python

前端 未结 5 1749
面向向阳花
面向向阳花 2021-01-07 04:16

I have a string.

s = u\"

        
相关标签:
5条回答
  • 2021-01-07 04:28

    in python2, try this to convert your unicode string:

    >>> s.encode('latin-1').decode('gbk')
    u"<script language=javascript>alert('\u8bf7\u8f93\u5165\u6b63\u786e\u9a8c\u8bc1\u7801,\u8c22\u8c22!');location='index.asp';</script></script>"
    

    then you can encode to utf-8 as you wish.

    >>> s.encode('latin-1').decode('gbk').encode('utf-8')
    "<script language=javascript>alert('\xe8\xaf\xb7\xe8\xbe\x93\xe5\x85\xa5\xe6\xad\xa3\xe7\xa1\xae\xe9\xaa\x8c\xe8\xaf\x81\xe7\xa0\x81,\xe8\xb0\xa2\xe8\xb0\xa2!');location='index.asp';</script></script>"
    
    0 讨论(0)
  • 2021-01-07 04:35

    I got the same question

    Like this:

    name = u'\xb9\xc5\xbd\xa3\xc6\xe6\xcc\xb7'

    I want convert to

    u'\u53e4\u5251\u5947\u8c2d'

    Here is my solution:

    new_name = name.encode('iso-8859-1').decode('gbk')

    And I tried yours

    s = u"alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';"

    print s

    alert('ÇëÊäÈëÕýÈ·ÑéÖ¤Âë,лл!');location='index.asp';

    Then:

    _s = s.encode('iso-8859-1').decode('gbk')

    print _s

    alert('请输入正确验证码,谢谢!');location='index.asp';

    Hope can help you ..

    0 讨论(0)
  • 2021-01-07 04:41

    If you can keep the alert in a separate string "a":

    a = '\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!'.decode("gbk")
    s = u"<script language=javascript>alert('"+a+"');location='index.asp';</script></script>"
    print s
    

    Then it will print:

    <script language=javascript>alert('请输入正确验证码,谢谢!');location='index.asp';</script></script>
    

    If you want to automatically extract the substring in one go:

    s = "<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
    s = unicode("'".join((s.decode("gbk").split("'",2))))
    print s
    

    will print:

     <script language=javascript>alert('请输入正确验证码,谢谢!');location='index.asp';</script></script>
    
    0 讨论(0)
  • 2021-01-07 04:46

    You are mixing apples and oranges. The GBK-encoded string is not a Unicode string and should hence not end up in a u'...' string.

    This is the correct way to do it in Python 2.

    g = '\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,' \
        '\xd0\xbb\xd0\xbb!'.decode('gbk')
    s = u"<script language=javascript>alert(" + g + 
        u");location='index.asp';</script></script>"
    

    Notice how the initializer for g which is passed to .decode('gbk') is not represented as a Unicode string, but as a plain byte string.

    See also http://nedbatchelder.com/text/unipain.html

    0 讨论(0)
  • 2021-01-07 04:53

    Take a look at unicodedata but I think one way to do this is:

    import unicodedata
    
    s = u"<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
    unicodedata.normalize('NFKD', s).encode('utf-8','ignore')
    
    0 讨论(0)
提交回复
热议问题