amazon-data-pipeline

Automatic AWS DynamoDB to S3 export failing with “role/DataPipelineDefaultRole is invalid”

一曲冷凌霜 提交于 2019-12-05 02:12:26
Precisely following the step-by-step instructions on this page I am trying to export contents of one of my DynamoDB tables to an S3 bucket. I create a pipeline exactly as instructed but it fails to run. It seems that it has trouble identifying/running an EC2 resource to do the export. When I access EMR through AWS Console, I see entries like this: Cluster: df-0..._@EmrClusterForBackup_2015-03-06T00:33:04Terminated with errorsEMR service role arn:aws:iam::...:role/DataPipelineDefaultRole is invalid Why am I getting this message? Do I need to set up/configure something else for the pipeline to

Run a python script via AWS Data Pipelines

柔情痞子 提交于 2019-12-02 06:24:51
问题 I use AWS Data Pipelines to run nightly SQL queries that populate tables for summary statistics. The UI's a bit funky, but eventually I got it up and working. Now I'd like to do something similar with a python script. I have a file that I run every morning on my laptop ( forecast_rev.py ) but of course that means I have to turn on my laptop and kick this off every day. Surely I can schedule a Pipeline to do the same thing, and thus go away on vacation and not care. For the life of me, I can't

Run a python script via AWS Data Pipelines

半腔热情 提交于 2019-12-02 03:41:34
I use AWS Data Pipelines to run nightly SQL queries that populate tables for summary statistics. The UI's a bit funky, but eventually I got it up and working. Now I'd like to do something similar with a python script. I have a file that I run every morning on my laptop ( forecast_rev.py ) but of course that means I have to turn on my laptop and kick this off every day. Surely I can schedule a Pipeline to do the same thing, and thus go away on vacation and not care. For the life of me, I can't find a tutorial, AWS doc, or StackOverflow about this! I'm not even sure how to get started. Does

Need strategy advice for migrating large tables from RDS to DynamoDB

让人想犯罪 __ 提交于 2019-12-01 06:44:41
问题 We have a couple of mySql tables in RDS that are huge (over 700 GB), that we'd like to migrate to a DynamoDB table. Can you suggest a strategy, or a direction to do this in a clean, parallelized way? Perhaps using EMR or the AWS Data Pipeline. 回答1: You can use AWS Pipeline. There are two basic templates, one for moving RDS tables to S3 and the second for importing data from S3 to DynamoDB. You can create your own pipeline using both templates. Regards 回答2: one thing to consider with such

Amazon Data Pipeline: How to use a script argument in a SqlActivity?

谁都会走 提交于 2019-11-30 13:22:28
问题 When trying to use a Script Argument in the sqlActivity: { "id" : "ActivityId_3zboU", "schedule" : { "ref" : "DefaultSchedule" }, "scriptUri" : "s3://location_of_script/unload.sql", "name" : "unload", "runsOn" : { "ref" : "Ec2Instance" }, "scriptArgument" : [ "'s3://location_of_unload/#format(minusDays(@scheduledStartTime,1),'YYYY/MM/dd/hhmm/')}'", "'aws_access_key_id=????;aws_secret_access_key=*******'" ], "type" : "SqlActivity", "dependsOn" : { "ref" : "ActivityId_YY69k" }, "database" : {

How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?

无人久伴 提交于 2019-11-29 11:08:32
I would like to upgrade my AWS data pipeline definition to EMR 4.x or 5.x , so I can take advantage of Hive's latest features (version 2.0+), such as CURRENT_DATE and CURRENT_TIMESTAMP , etc. The change from EMR 3.x to 4.x/5.x requires the use of releaseLabel in EmrCluster , versus amiVersion . When I use a "releaseLabel": "emr-4.1.0" , I get the following error: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask Below is my data pipeline definition, for EMR 3.x. It works well, so I hope others find this useful (including the answer for emr 4.x/5.x), as the

How to pipe data from AWS Postgres RDS to S3 (then Redshift)?

假如想象 提交于 2019-11-29 03:11:50
问题 I'm using AWS data pipeline service to pipe data from a RDS MySql database to s3 and then on to Redshift , which works nicely. However, I also have data living in an RDS Postres instance which I would like to pipe the same way but I'm having a hard time setting up the jdbc-connection. If this is unsupported, is there a work-around? "connectionString": "jdbc:postgresql://THE_RDS_INSTANCE:5432/THE_DB” 回答1: this doesn't work yet. aws hasnt built / released the functionality to connect nicely to

Exporting a AWS Postgres RDS Table to AWS S3

对着背影说爱祢 提交于 2019-11-28 13:44:19
I wanted to use AWS Data Pipeline to pipe data from a Postgres RDS to AWS S3. Does anybody know how this is done? More precisely, I wanted to export a Postgres Table to AWS S3 using data Pipeline. The reason I am using Data Pipeline is I want to automate this process and this export is going to run once every week. Any other suggestions will also work. There is a sample on github. https://github.com/awslabs/data-pipeline-samples/tree/master/samples/RDStoS3 Here is the code: https://github.com/awslabs/data-pipeline-samples/blob/master/samples/RDStoS3/RDStoS3Pipeline.json I built a Pipeline from