AWS Glue: How to handle nested JSON with varying schemas

后端 未结 5 1728
独厮守ぢ
独厮守ぢ 2021-01-31 11:29

Objective: We\'re hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse v

5条回答
  •  遥遥无期
    2021-01-31 11:45

    The procedure I found useful to shallow nested json:

    1. ApplyMapping for the first level as datasource0;

    2. Explode struct or array objects to get rid of element level df1 = datasource0.toDF().select(id,col1,col2,...,explode(coln).alias(coln), where explode requires from pyspark.sql.functions import explode;

    3. Select the JSON objects that you would like to keep intact by intact_json = df1.select(id, itct1, itct2,..., itctm);

    4. Transform df1 back to dynamicFrame and Relationalize the dynamicFrame as well as drop the intact columns by dataframe.drop_fields(itct1, itct2,..., itctm);

    5. Join relationalized table with the intact table based on 'id' column.

提交回复
热议问题