AWS Glue Crawler Creates Partition and File Tables

后端未结

关注

 2  694

I have a pretty basic s3 setup that I would like to query against using Athena. The data is all stored in one bucket, organized into year/month/day/hour folders.

相关标签:

2条回答

花落未央

2021-01-18 02:37

Glue Crawler leaves a lot to be desired. It's promises to solve a lot of situations, but is really limited in what it actually supports. If your data is stored in directories and does not use Hive-style partitioning (e.g. year=2019/month=02/file.json) it will more often than not mess up. It's especially frustrating when the data is produced by other AWS products, like Kinesis Firehose, which it looks like your data could be.

Depending on how much data you have I might start by just creating an unpartitioned Athena table that pointed to the root of the structure. It's only once your data grows beyond multiple gigabytes or thousands of files that partitioning becomes important.

Another strategy you could employ is to add a Lambda function that gets triggered by an S3 notification whenever a new object lands in your bucket. The function could look at the key and figure out which partition it belongs to and use the Glue API to add that partition to the table. Adding a partition that already exists will return an error from the API, but as long as your function catches it and ignores it you will be fine.

0 讨论(0)
发布评论:

提交评论
- 加载中...
南旧

2021-01-18 02:41

Most times files with just one record create separate tables. I tried files with greater than 2 records and was able to group everything under one table with respective partitions.

How does your json files look like?

0 讨论(0)
发布评论:

提交评论
- 加载中...