I\'m using Python 2 to parse JSON from ASCII encoded text files.
When loading these files with either json or simplejson, all my
There exists an easy work-around.
TL;DR - Use ast.literal_eval()
instead of json.loads()
. Both ast
and json
are in the standard library.
While not a 'perfect' answer, it gets one pretty far if your plan is to ignore Unicode altogether. In Python 2.7
import json, ast
d = { 'field' : 'value' }
print "JSON Fail: ", json.loads(json.dumps(d))
print "AST Win:", ast.literal_eval(json.dumps(d))
gives:
JSON Fail: {u'field': u'value'}
AST Win: {'field': 'value'}
This gets more hairy when some objects are really Unicode strings. The full answer gets hairy quickly.
Just use pickle instead of json for dump and load, like so:
import json
import pickle
d = { 'field1': 'value1', 'field2': 2, }
json.dump(d,open("testjson.txt","w"))
print json.load(open("testjson.txt","r"))
pickle.dump(d,open("testpickle.txt","w"))
print pickle.load(open("testpickle.txt","r"))
The output it produces is (strings and integers are handled correctly):
{u'field2': 2, u'field1': u'value1'}
{'field2': 2, 'field1': 'value1'}
I ran into this problem too, and having to deal with JSON, I came up with a small loop that converts the unicode keys to strings. (simplejson
on GAE does not return string keys.)
obj
is the object decoded from JSON:
if NAME_CLASS_MAP.has_key(cls):
kwargs = {}
for i in obj.keys():
kwargs[str(i)] = obj[i]
o = NAME_CLASS_MAP[cls](**kwargs)
o.save()
kwargs
is what I pass to the constructor of the GAE application (which does not like unicode
keys in **kwargs
)
Not as robust as the solution from Wells, but much smaller.
I rewrote Wells's _parse_json() to handle cases where the json object itself is an array (my use case).
def _parseJSON(self, obj):
if isinstance(obj, dict):
newobj = {}
for key, value in obj.iteritems():
key = str(key)
newobj[key] = self._parseJSON(value)
elif isinstance(obj, list):
newobj = []
for value in obj:
newobj.append(self._parseJSON(value))
elif isinstance(obj, unicode):
newobj = str(obj)
else:
newobj = obj
return newobj
Mike Brennan's answer is close, but there is no reason to re-traverse the entire structure. If you use the object_hook_pairs (Python 2.7+) parameter:
object_pairs_hook
is an optional function that will be called with the result of any object literal decoded with an ordered list of pairs. The return value ofobject_pairs_hook
will be used instead of thedict
. This feature can be used to implement custom decoders that rely on the order that the key and value pairs are decoded (for example,collections.OrderedDict
will remember the order of insertion). Ifobject_hook
is also defined, theobject_pairs_hook
takes priority.
With it, you get each JSON object handed to you, so you can do the decoding with no need for recursion:
def deunicodify_hook(pairs):
new_pairs = []
for key, value in pairs:
if isinstance(value, unicode):
value = value.encode('utf-8')
if isinstance(key, unicode):
key = key.encode('utf-8')
new_pairs.append((key, value))
return dict(new_pairs)
In [52]: open('test.json').read()
Out[52]: '{"1": "hello", "abc": [1, 2, 3], "def": {"hi": "mom"}, "boo": [1, "hi", "moo", {"5": "some"}]}'
In [53]: json.load(open('test.json'))
Out[53]:
{u'1': u'hello',
u'abc': [1, 2, 3],
u'boo': [1, u'hi', u'moo', {u'5': u'some'}],
u'def': {u'hi': u'mom'}}
In [54]: json.load(open('test.json'), object_pairs_hook=deunicodify_hook)
Out[54]:
{'1': 'hello',
'abc': [1, 2, 3],
'boo': [1, 'hi', 'moo', {'5': 'some'}],
'def': {'hi': 'mom'}}
Notice that I never have to call the hook recursively since every object will get handed to the hook when you use the object_pairs_hook
. You do have to care about lists, but as you can see, an object within a list will be properly converted, and you don't have to recurse to make it happen.
EDIT: A coworker pointed out that Python2.6 doesn't have object_hook_pairs
. You can still use this will Python2.6 by making a very small change. In the hook above, change:
for key, value in pairs:
to
for key, value in pairs.iteritems():
Then use object_hook
instead of object_pairs_hook
:
In [66]: json.load(open('test.json'), object_hook=deunicodify_hook)
Out[66]:
{'1': 'hello',
'abc': [1, 2, 3],
'boo': [1, 'hi', 'moo', {'5': 'some'}],
'def': {'hi': 'mom'}}
Using object_pairs_hook
results in one less dictionary being instantiated for each object in the JSON object, which, if you were parsing a huge document, might be worth while.
While there are some good answers here, I ended up using PyYAML to parse my JSON files, since it gives the keys and values as str
type strings instead of unicode
type. Because JSON is a subset of YAML it works nicely:
>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']
Some things to note though:
I get string objects because all my entries are ASCII encoded. If I would use unicode encoded entries, I would get them back as unicode objects — there is no conversion!
You should (probably always) use PyYAML's safe_load
function; if you use it to load JSON files, you don't need the "additional power" of the load
function anyway.
If you want a YAML parser that has more support for the 1.2 version of the spec (and correctly parses very low numbers) try Ruamel YAML: pip install ruamel.yaml
and import ruamel.yaml as yaml
was all I needed in my tests.
As stated, there is no conversion! If you can't be sure to only deal with ASCII values (and you can't be sure most of the time), better use a conversion function:
I used the one from Mark Amery a couple of times now, it works great and is very easy to use. You can also use a similar function as an object_hook
instead, as it might gain you a performance boost on big files. See the slightly more involved answer from Mirec Miskuf for that.