I\'m using Python 2 to parse JSON from ASCII encoded text files.
When loading these files with either json or simplejson, all my
There's no built-in option to make the json module functions return byte strings instead of unicode strings. However, this short and simple recursive function will convert any decoded JSON object from using unicode strings to UTF-8-encoded byte strings:
def byteify(input):
if isinstance(input, dict):
return {byteify(key): byteify(value)
for key, value in input.iteritems()}
elif isinstance(input, list):
return [byteify(element) for element in input]
elif isinstance(input, unicode):
return input.encode('utf-8')
else:
return input
Just call this on the output you get from a json.load
or json.loads
call.
A couple of notes:
return {byteify(key): byteify(value) for key, value in input.iteritems()}
with return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()])
, since dictionary comprehensions weren't supported until Python 2.7.object_hook
or object_pairs_hook
parameters. Mirec Miskuf's answer is so far the only one that manages to pull this off correctly, although as a consequence, it's significantly more complicated than my approach.Support Python2&3 using hook (from https://stackoverflow.com/a/33571117/558397)
import requests
import six
from six import iteritems
requests.packages.urllib3.disable_warnings() # @UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)
def _byteify(data):
# if this is a unicode string, return its string representation
if isinstance(data, six.string_types):
return str(data.encode('utf-8').decode())
# if this is a list of values, return list of byteified values
if isinstance(data, list):
return [ _byteify(item) for item in data ]
# if this is a dictionary, return dictionary of byteified keys and values
# but only if we haven't already byteified it
if isinstance(data, dict):
return {
_byteify(key): _byteify(value) for key, value in iteritems(data)
}
# if it's anything else, return it in its original form
return data
w = r.json(object_hook=_byteify)
print(w)
Returns:
{'three': '', 'key': 'value', 'one': 'two'}
The gotcha is that simplejson
and json
are two different modules, at least in the manner they deal with unicode. You have json
in py 2.6+, and this gives you unicode values, whereas simplejson
returns string objects. Just try easy_install-ing simplejson in your environment and see if that works. It did for me.
here is a recursive encoder written in C: https://github.com/axiros/nested_encode
Performance overhead for "average" structures around 10% compared to json.loads.
python speed.py
json loads [0.16sec]: {u'a': [{u'b': [[1, 2, [u'\xd6ster..
json loads + encoding [0.18sec]: {'a': [{'b': [[1, 2, ['\xc3\x96ster.
time overhead in percent: 9%
using this teststructure:
import json, nested_encode, time
s = """
{
"firstName": "Jos\\u0301",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "\\u00d6sterreich",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [],
"spouse": null,
"a": [{"b": [[1, 2, ["\\u00d6sterreich"]]]}]
}
"""
t1 = time.time()
for i in xrange(10000):
u = json.loads(s)
dt_json = time.time() - t1
t1 = time.time()
for i in xrange(10000):
b = nested_encode.encode_nested(json.loads(s))
dt_json_enc = time.time() - t1
print "json loads [%.2fsec]: %s..." % (dt_json, str(u)[:20])
print "json loads + encoding [%.2fsec]: %s..." % (dt_json_enc, str(b)[:20])
print "time overhead in percent: %i%%" % (100 * (dt_json_enc - dt_json)/dt_json)
object_hook
import json
def json_load_byteified(file_handle):
return _byteify(
json.load(file_handle, object_hook=_byteify),
ignore_dicts=True
)
def json_loads_byteified(json_text):
return _byteify(
json.loads(json_text, object_hook=_byteify),
ignore_dicts=True
)
def _byteify(data, ignore_dicts = False):
# if this is a unicode string, return its string representation
if isinstance(data, unicode):
return data.encode('utf-8')
# if this is a list of values, return list of byteified values
if isinstance(data, list):
return [ _byteify(item, ignore_dicts=True) for item in data ]
# if this is a dictionary, return dictionary of byteified keys and values
# but only if we haven't already byteified it
if isinstance(data, dict) and not ignore_dicts:
return {
_byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
for key, value in data.iteritems()
}
# if it's anything else, return it in its original form
return data
Example usage:
>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}
Mark Amery's function is shorter and clearer than these ones, so what's the point of them? Why would you want to use them?
Purely for performance. Mark's answer decodes the JSON text fully first with unicode strings, then recurses through the entire decoded value to convert all strings to byte strings. This has a couple of undesirable effects:
This answer mitigates both of those performance issues by using the object_hook
parameter of json.load
and json.loads
. From the docs:
object_hook
is an optional function that will be called with the result of any object literal decoded (adict
). The return value of object_hook will be used instead of thedict
. This feature can be used to implement custom decoders
Since dictionaries nested many levels deep in other dictionaries get passed to object_hook
as they're decoded, we can byteify any strings or lists inside them at that point and avoid the need for deep recursion later.
Mark's answer isn't suitable for use as an object_hook
as it stands, because it recurses into nested dictionaries. We prevent that recursion in this answer with the ignore_dicts
parameter to _byteify
, which gets passed to it at all times except when object_hook
passes it a new dict
to byteify. The ignore_dicts
flag tells _byteify
to ignore dict
s since they already been byteified.
Finally, our implementations of json_load_byteified
and json_loads_byteified
call _byteify
(with ignore_dicts=True
) on the result returned from json.load
or json.loads
to handle the case where the JSON text being decoded doesn't have a dict
at the top level.
You can use the object_hook
parameter for json.loads to pass in a converter. You don't have to do the conversion after the fact. The json module will always pass the object_hook
dicts only, and it will recursively pass in nested dicts, so you don't have to recurse into nested dicts yourself. I don't think I would convert unicode strings to numbers like Wells shows. If it's a unicode string, it was quoted as a string in the JSON file, so it is supposed to be a string (or the file is bad).
Also, I'd try to avoid doing something like str(val)
on a unicode
object. You should use value.encode(encoding)
with a valid encoding, depending on what your external lib expects.
So, for example:
def _decode_list(data):
rv = []
for item in data:
if isinstance(item, unicode):
item = item.encode('utf-8')
elif isinstance(item, list):
item = _decode_list(item)
elif isinstance(item, dict):
item = _decode_dict(item)
rv.append(item)
return rv
def _decode_dict(data):
rv = {}
for key, value in data.iteritems():
if isinstance(key, unicode):
key = key.encode('utf-8')
if isinstance(value, unicode):
value = value.encode('utf-8')
elif isinstance(value, list):
value = _decode_list(value)
elif isinstance(value, dict):
value = _decode_dict(value)
rv[key] = value
return rv
obj = json.loads(s, object_hook=_decode_dict)