问题
I have about 30 tables in my RDS postgres / oracle (haven't decided if it is oracle or postgres yet) instance. I want to fetch all the records that have been inserted / updated in the last 4 hours (configurable) , create a csv file pertaining to each table and store the files in S3. I want this whole process to be transactional. If there is any error in fetching data from one table , I don't want data pertinent to other 29 tables to be persisted in S3. The data isn't very large , it should be in the order of few 100 records or less in each table for the duration of 4 hours.
I am thinking of having a spark job in EMR cluster to fetch data from RDS , create a csv for each table and post all the files to S3 at the end of the process. The EMR cluster will be destroyed once data is posted to S3. A cloudwatch trigger will invoke a lamda every 4 hours which will spin up a new EMR cluster which performs this job.
Are there any alternate approaches worth exploring for this transformation?
回答1:
Take a look at AWS Glue which is using EMR under the hood but you don't need to care about infrastructure and configurations, just setup crawler and write your ETL job.
Please note that AWS Glue doesn't support predicates pushdown for JDBC connections (currently s3 only) so it means it will load entire table first and only then apply filtering.
Also you should carefully think about atomicity since Glue ETL job simply processes data and writes to a sink without transactions. In case of failure it won't remove partially written records so you should manage it by yourself. There are few options I would consider:
- Write data into temp folder (local or s3) per each execution and then move objects to final destination with aws s3 sync command or copy data using TransferManager from AWS SDK
- Write data to the final destination into dedicated folder and in case of failure delete it using CLI or SDK
来源:https://stackoverflow.com/questions/50361589/rds-to-s3-data-transformation-aws