Objective: We\'re hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse v
The procedure I found useful to shallow nested json:
ApplyMapping for the first level as datasource0
;
Explode struct
or array
objects to get rid of element level
df1 = datasource0.toDF().select(id,col1,col2,...,explode(coln).alias(coln)
, where explode
requires from pyspark.sql.functions import explode
;
Select the JSON objects that you would like to keep intact by intact_json = df1.select(id, itct1, itct2,..., itctm)
;
Transform df1
back to dynamicFrame and Relationalize the
dynamicFrame as well as drop the intact columns by dataframe.drop_fields(itct1, itct2,..., itctm)
;
Join relationalized table with the intact table based on 'id' column.