问题
Take for example an s3 bucket with the following structure with files of the form francescototti_yyyy_mm_dd_hh.csv.gz:
For example:
francescototti_2019_05_01_00.csv.gz,
francescototti_2019_05_01_01.csv.gz,
francescototti_2019_05_01_02.csv.gz,
.....
francescototti_2019_05_01_23.csv.gz,
francescototti_2019_05_02_00.csv.gz
Each hourly file is about 30 MB. I would like the final hive table to be partitioned by day stored as orc files.
What is the best way to do this? I imagine a few ways, potentially one of the following.
an automated script to take the days hourly files and move them into corresponding day folder in the s3 bucket. Create partitioned external table over this newly structured s3 bucket.
have an external hive table on top of the raw s3 location and create an additional paritioned hive table that gets inserted into from the the raw table.
What are the pros/cons of each? Any other recommendations?
回答1:
First option: (an automated script to take the days hourly files and move them into corresponding day folder in the s3 bucket. Create partitioned external table over this newly structured s3 bucket) looks better than building files on top of the raw s3 location because raw location contains too many files and query will work slow because it will list all of them even if you are filtering by INPUT__FILE__NAME virtual column and if you receiving fresh files in it then it will get even worse.
If there are not too many files, say hundreds in raw folder and it is not growing then I would chose option2.
The possible drawback of option one is eventual consistency issue after removing files and repeatedly reading/listing the folder. After removing big number of files (say thousands at a time) you will definitely hit the eventual consistency issue (phantom files) during next 1 hour or so. If you are not going to remove too many files at a time. And it seems you are not, you will move 24 files at a time, then with very high probability you will not hit eventual consistency problem in s3. Another drawback is that moving files costs. But anyway it is better than reading/listing too many files in the same folder.
So, option one looks better.
Other recommendations: Rewrite upstream process so that it writes files into daily folders. This is the best option. In this case you can build table on top of s3 top level location and every day only add daily partition. Partition pruning will work fine and you do not need to move files and no issue with inconsistency in s3.
回答2:
You can configure Amazon S3 Events to automatically trigger an AWS Lambda function when an object is created in an Amazon S3 bucket.
This Lambda function could read the filename (Key) and move the object into another directory (actually, it would copy + delete the object).
This way, the objects are moved to the desired location as soon as they are created. No batch jobs needed.
However, this would not change the format of the files. The content could be converted by using Amazon Athena to Convert to Columnar Formats. That's a bit more tricky since you'd need to specify the source and destination.
来源:https://stackoverflow.com/questions/56190406/how-to-move-amazon-s3-objects-into-partitioned-directories