pandas.io.json.json_normalize with very nested json

后端 未结 2 864
野性不改
野性不改 2020-11-30 00:04

I have been trying to normalize a very nested json file I will later analyze. What I am struggling with is how to go more than one level deep to normalize.

相关标签:
2条回答
  • 2020-11-30 00:31

    You can also have a look at the library flatten_json, which does not require you to write column hierarchies as in json_normalize:

    from flatten_json import flatten
    
    data = d['hits']['hits']
    dict_flattened = (flatten(record, '.') for record in data)
    df = pd.DataFrame(dict_flattened)
    print(df)
    

    See https://github.com/amirziai/flatten.

    0 讨论(0)
  • 2020-11-30 00:46

    In the pandas example (below) what do the brackets mean? Is there a logic to be followed to go deeper with the []. [...]

    result = json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])
    

    Each string or list of strings in the ['state', 'shortname', ['info', 'governor']] value is a path to an element to include, in addition to the selected rows. The second argument json_normalize() argument (record_path, set to 'counties' in the documentation example) tells the function how to select elements from the input data structure that make up the rows in the output, and the meta paths adds further metadata that will be included with each of the rows. Think of these as table joins in a database, if you will.

    The input for the US States documentation example has two dictionaries in a list, and both of these dictionaries have a counties key that references another list of dicts:

    >>> data = [{'state': 'Florida',
    ...          'shortname': 'FL',
    ...         'info': {'governor': 'Rick Scott'},
    ...         'counties': [{'name': 'Dade', 'population': 12345},
    ...                      {'name': 'Broward', 'population': 40000},
    ...                      {'name': 'Palm Beach', 'population': 60000}]},
    ...         {'state': 'Ohio',
    ...          'shortname': 'OH',
    ...          'info': {'governor': 'John Kasich'},
    ...          'counties': [{'name': 'Summit', 'population': 1234},
    ...                       {'name': 'Cuyahoga', 'population': 1337}]}]
    >>> pprint(data[0]['counties'])
    [{'name': 'Dade', 'population': 12345},
     {'name': 'Broward', 'population': 40000},
     {'name': 'Palm Beach', 'population': 60000}]
    >>> pprint(data[1]['counties'])
    [{'name': 'Summit', 'population': 1234},
     {'name': 'Cuyahoga', 'population': 1337}]
    

    Between them there are 5 rows of data to use in the output:

    >>> json_normalize(data, 'counties')
             name  population
    0        Dade       12345
    1     Broward       40000
    2  Palm Beach       60000
    3      Summit        1234
    4    Cuyahoga        1337
    

    The meta argument then names some elements that live next to those counties lists, and those are then merged in separately. The values from the first data[0] dictionary for those meta elements are ('Florida', 'FL', 'Rick Scott'), respectively, and for data[1] the values are ('Ohio', 'OH', 'John Kasich'), so you see those values attached to the counties rows that came from the same top-level dictionary, repeated 3 and 2 times respectively:

    >>> data[0]['state'], data[0]['shortname'], data[0]['info']['governor']
    ('Florida', 'FL', 'Rick Scott')
    >>> data[1]['state'], data[1]['shortname'], data[1]['info']['governor']
    ('Ohio', 'OH', 'John Kasich')
    >>> json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])
             name  population    state shortname info.governor
    0        Dade       12345  Florida        FL    Rick Scott
    1     Broward       40000  Florida        FL    Rick Scott
    2  Palm Beach       60000  Florida        FL    Rick Scott
    3      Summit        1234     Ohio        OH   John Kasich
    4    Cuyahoga        1337     Ohio        OH   John Kasich
    

    So, if you pass in a list for the meta argument, then each element in the list is a separate path, and each of those separate paths identifies data to add to the rows in the output.

    In your example JSON, there are only a few nested lists to elevate with the first argument, like 'counties' did in the example. The only example in that datastructure is the nested 'authors' key; you'd have to extract each ['_source', 'authors'] path, after which you can add other keys from the parent object to augment those rows.

    The second meta argument then pulls in the _id key from the outermost objects, followed by the nested ['_source', 'title'] and ['_source', 'journal'] nested paths.

    The record_path argument takes the authors lists as the starting point, these look like:

    >>> d['hits']['hits'][0]['_source']['authors']   # this value is None, and is skipped
    >>> d['hits']['hits'][1]['_source']['authors']
    [{'affiliations': ['Punjabi University'],
      'author_id': '780E3459',
      'author_name': 'munish puri'},
     {'affiliations': ['Punjabi University'],
      'author_id': '48D92C79',
      'author_name': 'rajesh dhaliwal'},
     {'affiliations': ['Punjabi University'],
      'author_id': '7D9BD37C',
      'author_name': 'r s singh'}]
    >>> d['hits']['hits'][2]['_source']['authors']
    [{'author_id': '7FF872BC',
      'author_name': 'barbara eileen ryan'}]
    >>> # etc.
    

    and so gives you the following rows:

    >>> json_normalize(d['hits']['hits'], ['_source', 'authors'])
               affiliations author_id          author_name
    0  [Punjabi University]  780E3459          munish puri
    1  [Punjabi University]  48D92C79      rajesh dhaliwal
    2  [Punjabi University]  7D9BD37C            r s singh
    3                   NaN  7FF872BC  barbara eileen ryan
    4                   NaN  0299B8E9     fraser j harbutt
    5                   NaN  7DAB7B72   richard m freeland
    

    and then we can use the third meta argument to add more columns like _id, _source.title and _source.journal, using ['_id', ['_source', 'journal'], ['_source', 'title']]:

    >>> json_normalize(
    ...     data['hits']['hits'],
    ...     ['_source', 'authors'],
    ...     ['_id', ['_source', 'journal'], ['_source', 'title']]
    ... )
               affiliations author_id          author_name       _id   \
    0  [Punjabi University]  780E3459          munish puri  7AF8EBC3  
    1  [Punjabi University]  48D92C79      rajesh dhaliwal  7AF8EBC3
    2  [Punjabi University]  7D9BD37C            r s singh  7AF8EBC3
    3                   NaN  7FF872BC  barbara eileen ryan  7521A721
    4                   NaN  0299B8E9     fraser j harbutt  7DAEB9A4
    5                   NaN  7DAB7B72   richard m freeland  7B3236C5
    
                                         _source.journal
    0  Journal of Industrial Microbiology & Biotechno...
    1  Journal of Industrial Microbiology & Biotechno...
    2  Journal of Industrial Microbiology & Biotechno...
    3                     The American Historical Review
    4                     The American Historical Review
    5                     The American Historical Review
    
                                           _source.title  \
    0  Development of a stable continuous flow immobi...
    1  Development of a stable continuous flow immobi...
    2  Development of a stable continuous flow immobi...
    3  Feminism and the women's movement : dynamics o...
    4  The iron curtain : Churchill, America, and the...
    5  The Truman Doctrine and the origins of McCarth...
    
    0 讨论(0)
提交回复
热议问题