I am working o scrapy, I scraped some sites and stored the items from the scraped page in to json files, but some of them are containing the following format.
You have byte strings containing unicode escapes. You can convert them to unicode with the unicode_escape
codec:
>>> print "H\u00eatres et \u00e9tang".decode("unicode_escape")
Hêtres et étang
And you can encode it back to byte strings:
>>> s = "H\u00eatres et \u00e9tang".decode("unicode_escape")
>>> s.encode("latin1")
'H\xeatres et \xe9tang'
You can filter and decode the non-unicode strings like:
for s in l:
if not isinstance(s, unicode):
print s.decode('unicode_escape')
i want to convert that and store the strings in the list with their original names like below
When you serialise to JSON, there may be a flag that allows you to turn off the escaping of non-ASCII characters to \u
sequences. If you are using the standard library json
module, it's ensure_ascii
:
>>> print json.dumps(u'Índia')
"\u00cdndia"
>>> print json.dumps(u'Índia', ensure_ascii= False)
"Índia"
However be aware that with that safety measure taken away you now have to be able to deal with non-ASCII characters in a correct way, or you'll get a bunch of UnicodeError
s. For example if you are writing the JSON to a file you must explicitly encode the Unicode string to the charset you want (for example UTF-8).
j= json.dumps(u'Índia', ensure_ascii= False)
open('file.json', 'wb').write(j.encode('utf-8'))