I\'m using Python 2 to parse JSON from ASCII encoded text files.
When loading these files with either json or simplejson, all my
I've adapted the code from the answer of Mark Amery, particularly in order to get rid of isinstance
for the pros of duck-typing.
The encoding is done manually and ensure_ascii
is disabled. The python docs for json.dump says that
If ensure_ascii is True (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences
Disclaimer: in the doctest I used the Hungarian language. Some notable Hungarian-related character encodings are: cp852
the IBM/OEM encoding used eg. in DOS (sometimes referred as ascii, incorrectly I think, it is dependent on the codepage setting), cp1250
used eg. in Windows (sometimes referred as ansi, dependent on the locale settings), and iso-8859-2
, sometimes used on http servers. The test text Tüskéshátú kígyóbűvölő
is attributed to Koltai László (native personal name form) and is from wikipedia.
# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json
def encode_items(input, encoding='utf-8'):
u"""original from: https://stackoverflow.com/a/13101776/611007
adapted by SO/u/611007 (20150623)
>>>
>>> ## run this with `python -m doctest <this file>.py` from command line
>>>
>>> txt = u"Tüskéshátú kígyóbűvölő"
>>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
>>> txt3 = u"uúuutifu"
>>> txt4 = b'u\\xfauutifu'
>>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
>>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
>>> txt4u = txt4.decode('cp1250')
>>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
>>> txt5 = b"u\\xc3\\xbauutifu"
>>> txt5u = txt5.decode('utf-8')
>>> txt6 = u"u\\u251c\\u2551uutifu"
>>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
>>> assert txt == there_and_back_again(txt)
>>> assert txt == there_and_back_again(txt2)
>>> assert txt3 == there_and_back_again(txt3)
>>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
>>> assert txt3 == txt4u,(txt3,txt4u)
>>> assert txt3 == there_and_back_again(txt5)
>>> assert txt3 == there_and_back_again(txt5u)
>>> assert txt3 == there_and_back_again(txt4u)
>>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
>>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
>>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
>>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
>>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
>>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
>>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
>>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
"""
try:
input.iteritems
return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
except AttributeError:
if isinstance(input, unicode):
return input.encode(encoding)
elif isinstance(input, str):
return input
try:
iter(input)
return [encode_items(e) for e in input]
except TypeError:
return input
def alt_dumps(obj, **kwargs):
"""
>>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
'{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
"""
if 'ensure_ascii' in kwargs:
del kwargs['ensure_ascii']
return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)
I'd also like to highlight the answer of Jarret Hardie which references the JSON spec, quoting:
A string is a collection of zero or more Unicode characters
In my use-case I had files with json. They are utf-8
encoded files. ensure_ascii
results in properly escaped but not very readable json files, that is why I've adapted Mark Amery's answer to fit my needs.
The doctest is not particularly thoughtful but I share the code in the hope that it will useful for someone.
Check out this answer to a similar question like this which states that
The u- prefix just means that you have a Unicode string. When you really use the string, it won't appear in your data. Don't be thrown by the printed output.
For example, try this:
print mail_accounts[0]["i"]
You won't see a u.
This is late to the game, but I built this recursive caster. It works for my needs and I think it's relatively complete. It may help you.
def _parseJSON(self, obj):
newobj = {}
for key, value in obj.iteritems():
key = str(key)
if isinstance(value, dict):
newobj[key] = self._parseJSON(value)
elif isinstance(value, list):
if key not in newobj:
newobj[key] = []
for i in value:
newobj[key].append(self._parseJSON(i))
elif isinstance(value, unicode):
val = str(value)
if val.isdigit():
val = int(val)
else:
try:
val = float(val)
except ValueError:
val = str(val)
newobj[key] = val
return newobj
Just pass it a JSON object like so:
obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)
I have it as a private member of a class, but you can repurpose the method as you see fit.