Im trying to normalize the below json file into 4 tables - \"content\", \"Modules\", \"Images\" and \"Everything Else in another table\"
{
\"id\": \"0000050
You could use the function defined by https://towardsdatascience.com/flattening-json-objects-in-python-f5343c794b10, as follows, and then use json_normalize
:
import pandas as pd
import json
with open('test.json') as json_file:
data = json.load(json_file)
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
module = flatten_json(data["content"][0])
module = pd.json_normalize(module)
Then, what you have to do is select the columns according to the four categories you described. The output is:
ID_0 content_revision ... locale_data_locale locale_data_identified_by
0 B01 1580225050941 ... en_US MACHINE_DETECT
Then you select as follows, for instance for your module and image DataFrames:
module = df.loc[:,df.columns.str.contains("module")]
image = df.loc[:,df.columns.str.contains("image")]
The result you get for module for instance is :
template_module_0_id ... template_module_1_product
0 module-11 ... None
Then, I give the example for the transformation of the module DataFrame, you only have two modules so you can do a concat
after renaming the columns:
module1 = module.loc[:,module.columns.str.contains("module_0")]
module1.columns = module1.columns.str.replace("_0","")
module2 = module.loc[:,module.columns.str.contains("module_1")]
module2.columns = module2.columns.str.replace("_1","")
modules = pd.concat([module1,
module2])
And you get:
template_module_id ... template_module_image_7_originalSrc
0 module-11 ... NaN
0 module-6 ... None
The other option if you had a lot more elements would be to use the flatten_json
and json_normalize
functions directly on the nested element you want.