Flattening Generic JSON List of Dicts or Lists in Python

问题

I have a set of arbitrary JSON data that has been parsed in Python to lists of dicts and lists of varying depth. I need to be able to 'flatten' this into a list of dicts. Example below:

Source Data Example 1

[{u'industry': [
   {u'id': u'112', u'name': u'A'},
   {u'id': u'132', u'name': u'B'},
   {u'id': u'110', u'name': u'C'},
   ],
  u'name': u'materials'},
 {u'industry': {u'id': u'210', u'name': u'A'},
  u'name': u'conglomerates'}
]

Desired Result Example 1

[{u'name':u'materials', u'industry_id':u'112', u'industry_name':u'A'},
 {u'name':u'materials', u'industry_id':u'132', u'industry_name':u'B'},
 {u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C'},
 {u'name':u'conglomerates', u'industry_id':u'210', u'industry_name':u'A'},
]

This is easy enough for this simple example, but I don't always have this exact structure of list o f dicts, with one additional layer of list of dicts. In some cases, I may have additional nesting that needs to follow the same methodology. As a result, I think I will need recursion and I cannot seem to get this to work.

Proposed Methodology

1) For Each List of Dicts, prepend each key with a 'path' that provides the name of the parent key. In the example above, 'industry' was the key which contained a list of dicts, so each of the children dicts in the list have 'industry' added to them.

2) Add 'Parent' Items to Each Dict within List - in this case, the 'name' and 'industry' were the items in the top level list of dicts, and so the 'name' key/value was added to each of the items in 'industry'.

I can imagine some scenarios where you had multiple lists of dicts or even dicts of dicts in the 'Parent' items and applying each of these sub-trees to the children list of dicts would not work. As a result, I'll assume that the 'parent' items are always simple key/value pairs.

One more example to try to illustrate the potential variabilities in data structure that need to be handled.

Source Data Example 2

[{u'industry': [
   {u'id': u'112', u'name': u'A'},
   {u'id': u'132', u'name': u'B'},
   {u'id': u'110', u'name': u'C', u'company': [
                            {u'id':'500', u'symbol':'X'},
                            {u'id':'502', u'symbol':'Y'},
                            {u'id':'504', u'symbol':'Z'},
                  ]
   },
   ],
  u'name': u'materials'},
 {u'industry': {u'id': u'210', u'name': u'A'},
  u'name': u'conglomerates'}
]

Desired Result Example 2

[{u'name':u'materials', u'industry_id':u'112', u'industry_name':u'A'},
 {u'name':u'materials', u'industry_id':u'132', u'industry_name':u'B'},
 {u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C', 
                        u'company_id':'500', u'company_symbol':'X'},
 {u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C', 
                        u'company_id':'502', u'company_symbol':'Y'},
 {u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C', 
                        u'company_id':'504', u'company_symbol':'Z'},
 {u'name':u'conglomerates', u'industry_id':u'210', u'industry_name':u'A'},
]

I have looked at several other examples and I can't seem to find one that works in these example cases.

Any suggestions or pointers? I've spent some time trying to build a recursive function to handle this with no luck after many hours...

UPDATED WITH ONE FAILED ATTEMPT

def _flatten(sub_tree, flattened=[], path="", parent_dict={}, child_dict={}):
    if type(sub_tree) is list:
        for i in sub_tree:
            flattened.append(_flatten(i,
                                      flattened=flattened,
                                      path=path,
                                      parent_dict=parent_dict,
                                      child_dict=child_dict
                                      )
                            )
        return flattened
    elif type(sub_tree) is dict:
        lists = {}
        new_parent_dict = {}
        new_child_dict = {}
        for key, value in sub_tree.items():
            new_path = path + '_' + key
            if type(value) is dict:
                for key2, value2 in value.items():
                    new_path2 = new_path + '_' + key2
                    new_parent_dict[new_path2] = value2
            elif type(value) is unicode:
                new_parent_dict[key] = value
            elif type(value) is list:
                lists[new_path] = value
        new_parent_dict.update(parent_dict)
        for key, value in lists.items():
            for i in value:
                flattened.append(_flatten(i,
                                      flattened=flattened,
                                      path=key,
                                      parent_dict=new_parent_dict,
                                      )
            )
        return flattened

The result I get is a 231x231 matrix of 'None' objects - clearly I'm getting into trouble with the recursion running away.

I've tried a few additional 'start from scratch' attempts and failed with a similar failure mode.

回答1:

Alright. My solution comes with two functions. The first, splitObj, takes care of splitting an object into the flat data and the sublist or subobject which will later require the recursion. The second, flatten, actually iterates of a list of objects, makes the recursive calls and takes care of reconstructing the final object for each iteration.

def splitObj (obj, prefix = None):
    '''
    Split the object, returning a 3-tuple with the flat object, optionally
    followed by the key for the subobjects and a list of those subobjects.
    '''
    # copy the object, optionally add the prefix before each key
    new = obj.copy() if prefix is None else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }

    # try to find the key holding the subobject or a list of subobjects
    for k, v in new.items():
        # list of subobjects
        if isinstance(v, list):
            del new[k]
            return new, k, v
        # or just one subobject
        elif isinstance(v, dict):
            del new[k]
            return new, k, [v]
    return new, None, None

def flatten (data, prefix = None):
    '''
    Flatten the data, optionally with each key prefixed.
    '''
    # iterate all items
    for item in data:
        # split the object
        flat, key, subs = splitObj(item, prefix)

        # just return fully flat objects
        if key is None:
            yield flat
            continue

        # otherwise recursively flatten the subobjects
        for sub in flatten(subs, key):
            sub.update(flat)
            yield sub

Note that this does not exactly produce your desired output. The reason for this is that your output is actually inconsistent. In the second example, for the case where there are companies nested in the industries, the nesting isn’t visible in the output. So instead, my output will generate industry_company_id and industry_company_symbol:

>>> ex1 = [{u'industry': [{u'id': u'112', u'name': u'A'},
                          {u'id': u'132', u'name': u'B'},
                          {u'id': u'110', u'name': u'C'}],
            u'name': u'materials'},
           {u'industry': {u'id': u'210', u'name': u'A'}, u'name': u'conglomerates'}]
>>> ex2 = [{u'industry': [{u'id': u'112', u'name': u'A'},
                          {u'id': u'132', u'name': u'B'},
                          {u'company': [{u'id': '500', u'symbol': 'X'},
                                        {u'id': '502', u'symbol': 'Y'},
                                        {u'id': '504', u'symbol': 'Z'}],
                           u'id': u'110',
                           u'name': u'C'}],
            u'name': u'materials'},
           {u'industry': {u'id': u'210', u'name': u'A'}, u'name': u'conglomerates'}]

>>> pprint(list(flatten(ex1)))
[{'industry_id': u'112', 'industry_name': u'A', u'name': u'materials'},
 {'industry_id': u'132', 'industry_name': u'B', u'name': u'materials'},
 {'industry_id': u'110', 'industry_name': u'C', u'name': u'materials'},
 {'industry_id': u'210', 'industry_name': u'A', u'name': u'conglomerates'}]
>>> pprint(list(flatten(ex2)))
[{'industry_id': u'112', 'industry_name': u'A', u'name': u'materials'},
 {'industry_id': u'132', 'industry_name': u'B', u'name': u'materials'},
 {'industry_company_id': '500',
  'industry_company_symbol': 'X',
  'industry_id': u'110',
  'industry_name': u'C',
  u'name': u'materials'},
 {'industry_company_id': '502',
  'industry_company_symbol': 'Y',
  'industry_id': u'110',
  'industry_name': u'C',
  u'name': u'materials'},
 {'industry_company_id': '504',
  'industry_company_symbol': 'Z',
  'industry_id': u'110',
  'industry_name': u'C',
  u'name': u'materials'},
 {'industry_id': u'210', 'industry_name': u'A', u'name': u'conglomerates'}]

来源：https://stackoverflow.com/questions/21512957/flattening-generic-json-list-of-dicts-or-lists-in-python

标签

python

json

list

dictionary

recursive-datastructures