I have a set of arbitrary JSON data that has been parsed in Python to lists of dicts and lists of varying depth. I need to be able to \'flatten\' this into a list of dicts.
Alright. My solution comes with two functions. The first, splitObj
, takes care of splitting an object into the flat data and the sublist or subobject which will later require the recursion. The second, flatten
, actually iterates of a list of objects, makes the recursive calls and takes care of reconstructing the final object for each iteration.
def splitObj (obj, prefix = None):
'''
Split the object, returning a 3-tuple with the flat object, optionally
followed by the key for the subobjects and a list of those subobjects.
'''
# copy the object, optionally add the prefix before each key
new = obj.copy() if prefix is None else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }
# try to find the key holding the subobject or a list of subobjects
for k, v in new.items():
# list of subobjects
if isinstance(v, list):
del new[k]
return new, k, v
# or just one subobject
elif isinstance(v, dict):
del new[k]
return new, k, [v]
return new, None, None
def flatten (data, prefix = None):
'''
Flatten the data, optionally with each key prefixed.
'''
# iterate all items
for item in data:
# split the object
flat, key, subs = splitObj(item, prefix)
# just return fully flat objects
if key is None:
yield flat
continue
# otherwise recursively flatten the subobjects
for sub in flatten(subs, key):
sub.update(flat)
yield sub
Note that this does not exactly produce your desired output. The reason for this is that your output is actually inconsistent. In the second example, for the case where there are companies nested in the industries, the nesting isn’t visible in the output. So instead, my output will generate industry_company_id
and industry_company_symbol
:
>>> ex1 = [{u'industry': [{u'id': u'112', u'name': u'A'},
{u'id': u'132', u'name': u'B'},
{u'id': u'110', u'name': u'C'}],
u'name': u'materials'},
{u'industry': {u'id': u'210', u'name': u'A'}, u'name': u'conglomerates'}]
>>> ex2 = [{u'industry': [{u'id': u'112', u'name': u'A'},
{u'id': u'132', u'name': u'B'},
{u'company': [{u'id': '500', u'symbol': 'X'},
{u'id': '502', u'symbol': 'Y'},
{u'id': '504', u'symbol': 'Z'}],
u'id': u'110',
u'name': u'C'}],
u'name': u'materials'},
{u'industry': {u'id': u'210', u'name': u'A'}, u'name': u'conglomerates'}]
>>> pprint(list(flatten(ex1)))
[{'industry_id': u'112', 'industry_name': u'A', u'name': u'materials'},
{'industry_id': u'132', 'industry_name': u'B', u'name': u'materials'},
{'industry_id': u'110', 'industry_name': u'C', u'name': u'materials'},
{'industry_id': u'210', 'industry_name': u'A', u'name': u'conglomerates'}]
>>> pprint(list(flatten(ex2)))
[{'industry_id': u'112', 'industry_name': u'A', u'name': u'materials'},
{'industry_id': u'132', 'industry_name': u'B', u'name': u'materials'},
{'industry_company_id': '500',
'industry_company_symbol': 'X',
'industry_id': u'110',
'industry_name': u'C',
u'name': u'materials'},
{'industry_company_id': '502',
'industry_company_symbol': 'Y',
'industry_id': u'110',
'industry_name': u'C',
u'name': u'materials'},
{'industry_company_id': '504',
'industry_company_symbol': 'Z',
'industry_id': u'110',
'industry_name': u'C',
u'name': u'materials'},
{'industry_id': u'210', 'industry_name': u'A', u'name': u'conglomerates'}]