Normalize a complex nested JSON file

后端 未结 1 1229
离开以前
离开以前 2021-01-28 04:10

Im trying to normalize the below json file into 4 tables - \"content\", \"Modules\", \"Images\" and \"Everything Else in another table\"

{
    \"id\": \"0000050         


        
1条回答
  •  孤城傲影
    2021-01-28 04:38

    You could use the function defined by https://towardsdatascience.com/flattening-json-objects-in-python-f5343c794b10, as follows, and then use json_normalize :

    import pandas as pd
    import json
    with open('test.json') as json_file:
        data = json.load(json_file)
    
    def flatten_json(y):
        out = {}
    
        def flatten(x, name=''):
            if type(x) is dict:
                for a in x:
                    flatten(x[a], name + a + '_')
            elif type(x) is list:
                i = 0
                for a in x:
                    flatten(a, name + str(i) + '_')
                    i += 1
            else:
                out[name[:-1]] = x
    
        flatten(y)
        return out
    
    module =  flatten_json(data["content"][0])
    module = pd.json_normalize(module)
    

    Then, what you have to do is select the columns according to the four categories you described. The output is:

    ID_0  content_revision  ... locale_data_locale locale_data_identified_by
    0  B01     1580225050941  ...              en_US            MACHINE_DETECT
    

    Then you select as follows, for instance for your module and image DataFrames:

    module = df.loc[:,df.columns.str.contains("module")]
    image = df.loc[:,df.columns.str.contains("image")]
    

    The result you get for module for instance is :

    template_module_0_id  ... template_module_1_product
    0            module-11  ...                      None
    

    Then, I give the example for the transformation of the module DataFrame, you only have two modules so you can do a concat after renaming the columns:

    module1 = module.loc[:,module.columns.str.contains("module_0")]
    module1.columns = module1.columns.str.replace("_0","")
    module2 = module.loc[:,module.columns.str.contains("module_1")]
    module2.columns = module2.columns.str.replace("_1","")
    modules = pd.concat([module1,
                         module2])
    

    And you get:

     template_module_id  ... template_module_image_7_originalSrc
    0          module-11  ...                                 NaN
    0           module-6  ...                                None
    

    The other option if you had a lot more elements would be to use the flatten_json and json_normalize functions directly on the nested element you want.

    0 讨论(0)
提交回复
热议问题