AWS Glue: How to handle nested JSON with varying schemas

后端未结

关注

 5  1728

独厮守ぢ 2021-01-31 11:29

Objective: We\'re hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse v

5条回答

遥遥无期 (楼主)

2021-01-31 11:45
The procedure I found useful to shallow nested json:
1. ApplyMapping for the first level as datasource0;
2. Explode struct or array objects to get rid of element level df1 = datasource0.toDF().select(id,col1,col2,...,explode(coln).alias(coln), where explode requires from pyspark.sql.functions import explode;
3. Select the JSON objects that you would like to keep intact by intact_json = df1.select(id, itct1, itct2,..., itctm);
4. Transform df1 back to dynamicFrame and Relationalize the dynamicFrame as well as drop the intact columns by dataframe.drop_fields(itct1, itct2,..., itctm);
5. Join relationalized table with the intact table based on 'id' column.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...