I have a set of arbitrary JSON data that has been parsed in Python to lists of dicts and lists of varying depth. I need to be able to 'flatten' this into a list of dicts. Example below:
Source Data Example 1
[{u'industry': [
{u'id': u'112', u'name': u'A'},
{u'id': u'132', u'name': u'B'},
{u'id': u'110', u'name': u'C'},
],
u'name': u'materials'},
{u'industry': {u'id': u'210', u'name': u'A'},
u'name': u'conglomerates'}
]
Desired Result Example 1
[{u'name':u'materials', u'industry_id':u'112', u'industry_name':u'A'},
{u'name':u'materials', u'industry_id':u'132', u'industry_name':u'B'},
{u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C'},
{u'name':u'conglomerates', u'industry_id':u'210', u'industry_name':u'A'},
]
This is easy enough for this simple example, but I don't always have this exact structure of list o f dicts, with one additional layer of list of dicts. In some cases, I may have additional nesting that needs to follow the same methodology. As a result, I think I will need recursion and I cannot seem to get this to work.
Proposed Methodology
1) For Each List of Dicts, prepend each key with a 'path' that provides the name of the parent key. In the example above, 'industry' was the key which contained a list of dicts, so each of the children dicts in the list have 'industry' added to them.
2) Add 'Parent' Items to Each Dict within List - in this case, the 'name' and 'industry' were the items in the top level list of dicts, and so the 'name' key/value was added to each of the items in 'industry'.
I can imagine some scenarios where you had multiple lists of dicts or even dicts of dicts in the 'Parent' items and applying each of these sub-trees to the children list of dicts would not work. As a result, I'll assume that the 'parent' items are always simple key/value pairs.
One more example to try to illustrate the potential variabilities in data structure that need to be handled.
Source Data Example 2
[{u'industry': [
{u'id': u'112', u'name': u'A'},
{u'id': u'132', u'name': u'B'},
{u'id': u'110', u'name': u'C', u'company': [
{u'id':'500', u'symbol':'X'},
{u'id':'502', u'symbol':'Y'},
{u'id':'504', u'symbol':'Z'},
]
},
],
u'name': u'materials'},
{u'industry': {u'id': u'210', u'name': u'A'},
u'name': u'conglomerates'}
]
Desired Result Example 2
[{u'name':u'materials', u'industry_id':u'112', u'industry_name':u'A'},
{u'name':u'materials', u'industry_id':u'132', u'industry_name':u'B'},
{u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C',
u'company_id':'500', u'company_symbol':'X'},
{u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C',
u'company_id':'502', u'company_symbol':'Y'},
{u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C',
u'company_id':'504', u'company_symbol':'Z'},
{u'name':u'conglomerates', u'industry_id':u'210', u'industry_name':u'A'},
]
I have looked at several other examples and I can't seem to find one that works in these example cases.
Any suggestions or pointers? I've spent some time trying to build a recursive function to handle this with no luck after many hours...
UPDATED WITH ONE FAILED ATTEMPT
def _flatten(sub_tree, flattened=[], path="", parent_dict={}, child_dict={}):
if type(sub_tree) is list:
for i in sub_tree:
flattened.append(_flatten(i,
flattened=flattened,
path=path,
parent_dict=parent_dict,
child_dict=child_dict
)
)
return flattened
elif type(sub_tree) is dict:
lists = {}
new_parent_dict = {}
new_child_dict = {}
for key, value in sub_tree.items():
new_path = path + '_' + key
if type(value) is dict:
for key2, value2 in value.items():
new_path2 = new_path + '_' + key2
new_parent_dict[new_path2] = value2
elif type(value) is unicode:
new_parent_dict[key] = value
elif type(value) is list:
lists[new_path] = value
new_parent_dict.update(parent_dict)
for key, value in lists.items():
for i in value:
flattened.append(_flatten(i,
flattened=flattened,
path=key,
parent_dict=new_parent_dict,
)
)
return flattened
The result I get is a 231x231 matrix of 'None' objects - clearly I'm getting into trouble with the recursion running away.
I've tried a few additional 'start from scratch' attempts and failed with a similar failure mode.
Alright. My solution comes with two functions. The first, splitObj
, takes care of splitting an object into the flat data and the sublist or subobject which will later require the recursion. The second, flatten
, actually iterates of a list of objects, makes the recursive calls and takes care of reconstructing the final object for each iteration.
def splitObj (obj, prefix = None):
'''
Split the object, returning a 3-tuple with the flat object, optionally
followed by the key for the subobjects and a list of those subobjects.
'''
# copy the object, optionally add the prefix before each key
new = obj.copy() if prefix is None else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }
# try to find the key holding the subobject or a list of subobjects
for k, v in new.items():
# list of subobjects
if isinstance(v, list):
del new[k]
return new, k, v
# or just one subobject
elif isinstance(v, dict):
del new[k]
return new, k, [v]
return new, None, None
def flatten (data, prefix = None):
'''
Flatten the data, optionally with each key prefixed.
'''
# iterate all items
for item in data:
# split the object
flat, key, subs = splitObj(item, prefix)
# just return fully flat objects
if key is None:
yield flat
continue
# otherwise recursively flatten the subobjects
for sub in flatten(subs, key):
sub.update(flat)
yield sub
Note that this does not exactly produce your desired output. The reason for this is that your output is actually inconsistent. In the second example, for the case where there are companies nested in the industries, the nesting isn’t visible in the output. So instead, my output will generate industry_company_id
and industry_company_symbol
:
>>> ex1 = [{u'industry': [{u'id': u'112', u'name': u'A'},
{u'id': u'132', u'name': u'B'},
{u'id': u'110', u'name': u'C'}],
u'name': u'materials'},
{u'industry': {u'id': u'210', u'name': u'A'}, u'name': u'conglomerates'}]
>>> ex2 = [{u'industry': [{u'id': u'112', u'name': u'A'},
{u'id': u'132', u'name': u'B'},
{u'company': [{u'id': '500', u'symbol': 'X'},
{u'id': '502', u'symbol': 'Y'},
{u'id': '504', u'symbol': 'Z'}],
u'id': u'110',
u'name': u'C'}],
u'name': u'materials'},
{u'industry': {u'id': u'210', u'name': u'A'}, u'name': u'conglomerates'}]
>>> pprint(list(flatten(ex1)))
[{'industry_id': u'112', 'industry_name': u'A', u'name': u'materials'},
{'industry_id': u'132', 'industry_name': u'B', u'name': u'materials'},
{'industry_id': u'110', 'industry_name': u'C', u'name': u'materials'},
{'industry_id': u'210', 'industry_name': u'A', u'name': u'conglomerates'}]
>>> pprint(list(flatten(ex2)))
[{'industry_id': u'112', 'industry_name': u'A', u'name': u'materials'},
{'industry_id': u'132', 'industry_name': u'B', u'name': u'materials'},
{'industry_company_id': '500',
'industry_company_symbol': 'X',
'industry_id': u'110',
'industry_name': u'C',
u'name': u'materials'},
{'industry_company_id': '502',
'industry_company_symbol': 'Y',
'industry_id': u'110',
'industry_name': u'C',
u'name': u'materials'},
{'industry_company_id': '504',
'industry_company_symbol': 'Z',
'industry_id': u'110',
'industry_name': u'C',
u'name': u'materials'},
{'industry_id': u'210', 'industry_name': u'A', u'name': u'conglomerates'}]
来源:https://stackoverflow.com/questions/21512957/flattening-generic-json-list-of-dicts-or-lists-in-python