Converting a latin string to unicode in python

后端 未结 2 1072
清歌不尽
清歌不尽 2021-01-03 08:17

I am working o scrapy, I scraped some sites and stored the items from the scraped page in to json files, but some of them are containing the following format.



        
相关标签:
2条回答
  • 2021-01-03 08:43

    You have byte strings containing unicode escapes. You can convert them to unicode with the unicode_escape codec:

    >>> print "H\u00eatres et \u00e9tang".decode("unicode_escape")
    Hêtres et étang
    

    And you can encode it back to byte strings:

    >>> s = "H\u00eatres et \u00e9tang".decode("unicode_escape")
    >>> s.encode("latin1")
    'H\xeatres et \xe9tang'
    

    You can filter and decode the non-unicode strings like:

    for s in l: 
        if not isinstance(s, unicode): 
            print s.decode('unicode_escape')
    
    0 讨论(0)
  • 2021-01-03 08:45

    i want to convert that and store the strings in the list with their original names like below

    When you serialise to JSON, there may be a flag that allows you to turn off the escaping of non-ASCII characters to \u sequences. If you are using the standard library json module, it's ensure_ascii:

    >>> print json.dumps(u'Índia')
    "\u00cdndia"
    >>> print json.dumps(u'Índia', ensure_ascii= False)
    "Índia"
    

    However be aware that with that safety measure taken away you now have to be able to deal with non-ASCII characters in a correct way, or you'll get a bunch of UnicodeErrors. For example if you are writing the JSON to a file you must explicitly encode the Unicode string to the charset you want (for example UTF-8).

    j= json.dumps(u'Índia', ensure_ascii= False)
    open('file.json', 'wb').write(j.encode('utf-8'))
    
    0 讨论(0)
提交回复
热议问题