How to ignore amazon athena struct order

后端未结

关注

 1  1991

I\'m getting an HIVE_PARTITION_SCHEMA_MISMATCH error that I\'m not quite sure what to do about. When I look at the 2 different schemas, the only thing that\'s

相关标签:

1条回答

清歌不尽

2020-12-06 16:24

I suggest you stop using Glue crawlers. It's probably not the response you had hoped for, but crawlers are really bad at their job. They can be useful sometimes as a way to get a schema from a random heap of data that someone else produced and that you don't want to spend time looking at to figure out what its schema is – but once you have a schema, and you know that new data will follow that schema, Glue crawlers are just in the way, and produce unnecessary problems like the one you have encountered.

What to do instead depends on how new data is added to S3.

If you are in control of the code that produces the data, you can add code that adds partitions after the data has been uploaded. The benefit of this solution is that partitions are added immediately after new data has been produced so tables are always up to date. However, it might tightly couple the data producing code with Glue (or Athena if you prefer to add partitions through SQL) in a way that is not desirable.

If it doesn't make sense to add the partitions from the code that produces the data, you can create a Lambda function that does it. You can either set it to run at a fixed time every day (if you know the location of the new data you don't have to wait until it exists, partitions can point to empty locations), or you can trigger it by S3 notifications (if there are multiple files you can either figure out a way to debounce the notifications through SQS or just create the partition over and over again, just swallow the error if the partition already exists).

You may also have heard of MSCK REPAIR TABLE …. It's better than Glue crawlers in some ways, but just as bad in other ways. It will only add new partitions, never change the schema, which is usually what you want, but it's extremely inefficient, and runs slower and slower the more files there are. Kind of like Glue crawlers.

0 讨论(0)
发布评论:

提交评论
- 加载中...