AWS Glue: How to handle nested JSON with varying schemas

后端未结

关注

 5  1735

Objective: We\'re hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse v

相关标签:

5条回答

佛祖请我去吃肉

2021-01-31 11:36

As of 12/20/2018, I was able to manually define a table with first level json fields as columns with type STRING. Then in the glue script the dynamicframe has the column as a string. From there, you can do an Unbox operation of type json on the fields. This will json parse the fields and derive the real schema. Combining Unbox with Filter allows you to loop through and process heterogeneous json schemas from the same input if you can loop through a list of schemas.

However, one word of caution, this is incredibly slow. I think that glue is downloading the source files from s3 during each iteration of the loop. I've been trying to find a way to persist the initial source data but it looks like .toDF derives the schema of the string json fields even if you specify them as glue StringType. I'll add a comment here if I can figure out a solution with better performance.

0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2021-01-31 11:40

you should add a glue classifier preferably $[*]

When you crawl the json file in s3, it will read the first line of the file.

You can create a glue job in order to load the data catalog table of this json file into the redshift.

My only problem with here is that Redshift Spectrum has problems reading json tables in the data catalog..

let me know if you have found a solution

0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2021-01-31 11:45
The procedure I found useful to shallow nested json:
1. ApplyMapping for the first level as datasource0;
2. Explode struct or array objects to get rid of element level df1 = datasource0.toDF().select(id,col1,col2,...,explode(coln).alias(coln), where explode requires from pyspark.sql.functions import explode;
3. Select the JSON objects that you would like to keep intact by intact_json = df1.select(id, itct1, itct2,..., itctm);
4. Transform df1 back to dynamicFrame and Relationalize the dynamicFrame as well as drop the intact columns by dataframe.drop_fields(itct1, itct2,..., itctm);
5. Join relationalized table with the intact table based on 'id' column.
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2021-01-31 11:53
I'm not sure you can do this with a table definition, but you can accomplish this with an ETL job by using a mapping function to cast the top level values as JSON strings. Documentation: [link]
```
import json

# Your mapping function
def flatten(rec):
    for key in rec:
        rec[key] = json.dumps(rec[key])
    return rec

old_df = glueContext.create_dynamic_frame.from_options(
    's3',
    {"paths": ['s3://...']},
    "json")

# Apply mapping function f to all DynamicRecords in DynamicFrame
new_df = Map.apply(frame=old_df, f=flatten)
```
From here you have the option of exporting to S3 (perhaps in Parquet or some other columnar format to optimize for querying) or directly into Redshift from my understanding, although I haven't tried it.
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人及你

2021-01-31 11:53

This is a limitation of Glue as of now. Have you taken a look at Glue Classifiers? It's the only piece I haven't used yet, but might suit your needs. You can define a JSON path for a field or something like that.

Other than that - Glue Jobs are the way to go. It's Spark in the background, so you can do pretty much everything. Set up a development endpoint and play around with it. I've run against various roadblocks for the last three weeks and decided to completely forgo any and all Glue functionality and only Spark, that way it's both portable and actually works.

One thing you might need to keep in mind when setting up the dev endpoint is that the IAM role must have a path of "/", so you will most probably need to create a separate role manually that has this path. The one automatically created has a path of "/service-role/".

0 讨论(0)
发布评论:

提交评论
- 加载中...