amazon-data-pipeline

Exporting a AWS Postgres RDS Table to AWS S3

余生长醉 提交于 2019-12-29 01:34:34
问题 I wanted to use AWS Data Pipeline to pipe data from a Postgres RDS to AWS S3. Does anybody know how this is done? More precisely, I wanted to export a Postgres Table to AWS S3 using data Pipeline. The reason I am using Data Pipeline is I want to automate this process and this export is going to run once every week. Any other suggestions will also work. 回答1: There is a sample on github. https://github.com/awslabs/data-pipeline-samples/tree/master/samples/RDStoS3 Here is the code: https:/

AWS Datapipeline - issue with accented characters

青春壹個敷衍的年華 提交于 2019-12-24 11:37:52
问题 I am new to AWS datapipeline. I created a successful datapipeline to pull all the content from RDS to S3 bucket. Everything works. I see my .csv file in S3 bucket. But I am storing spanish names in my table, in csv I see "Garc�a" instead of "García" 回答1: Looks like the wrong codepage is used. Just reference the correct codepage and you should be fine. The following topic might help: Text files uploaded to S3 are encoded strangely? 回答2: AWS DataPipeline is implemented in Java, and uses JDBC

Put json data pipeline definition using Boto3

£可爱£侵袭症+ 提交于 2019-12-24 08:02:15
问题 I have a data pipeline definition in json format, and I would like to 'put' that using Boto3 in Python. I know you can do this via the AWS CLI using put-pipeline-definition, but Boto3 (and the AWS API) use a different format, splitting the definition into pipelineObjects , parameterObjects and parameterValues . Do I need to write code to translate from a json definition to that expected by the API/Boto? If so, is there a library that does this? 回答1: The AWS CLI has code that does this

Load props file in EMR Spark Application

点点圈 提交于 2019-12-24 02:13:44
问题 I am trying to load custom properties in my spark application using :- command-runner.jar,spark-submit,--deploy-mode,cluster,--properties-file,s3://spark-config-test/myprops.conf,--num-executors,5,--executor-cores,2,--class,com.amazon.Main,#{input.directoryPath}/SWALiveOrderModelSpark-1.0-super.jar However, I am getting the following exception:- Exception in thread "main" java.lang.IllegalArgumentException: Invalid properties file 's3://spark-config-test/myprops.conf''. at org.apache.spark

Truncate DynamoDb or rewrite data via Data Pipeline

老子叫甜甜 提交于 2019-12-23 20:42:29
问题 There is possibility to dump DynamoDb via Data Pipeline and also import data in DynamoDb. Import is going well, but all the time data appends to already exists data in DynamoDb. For now I found work examples that scan DynamoDb and delete items one by one or via Batch. But at any rate for big amount of data it is not good variant. Also it is possible to delete table at all and create it. But with that variant indexes will be lost. So, best way would be to override DynamoDb data via import by

ShellCommandActivity in AWS Data Pipeline

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-23 04:46:02
问题 I am transferring Dynamo DB data to S3 using Data Pipeline. In the S3 bucket I get the backup but it is split into multiple files. To get the data in a single file I used a Shell Command Activity which runs the following command: aws s3 cat #{myOutputS3Loc}/#{format(@scheduledStartTime,'YYYY-MM-dd')}/* > #{myRenamedFile} This should concatenate all the files present in the S3 folder to a single file named #{myRenamedFile} . But I get the following error in data pipeline: usage: aws [options]

Amazon Redshift - Unload to S3 - Dynamic S3 file name

拜拜、爱过 提交于 2019-12-23 03:10:00
问题 I have been using UNLOAD statement in Redshift for a while now, it makes it easier to dump the file to S3 and then allow people to analysie. The time has come to try to automate it. We have Amazon Data Pipeline running for several tasks and I wanted to run SQLActivity to execute UNLOAD automatically. I use SQL script hosted in S3 . The query itself is correct but what I have been trying to figure out is how can I dynamically assign the name of the file. For example: UNLOAD('<the_query>') TO

Automatic AWS DynamoDB to S3 export failing with “role/DataPipelineDefaultRole is invalid”

こ雲淡風輕ζ 提交于 2019-12-22 04:01:26
问题 Precisely following the step-by-step instructions on this page I am trying to export contents of one of my DynamoDB tables to an S3 bucket. I create a pipeline exactly as instructed but it fails to run. It seems that it has trouble identifying/running an EC2 resource to do the export. When I access EMR through AWS Console, I see entries like this: Cluster: df-0..._@EmrClusterForBackup_2015-03-06T00:33:04Terminated with errorsEMR service role arn:aws:iam::...:role/DataPipelineDefaultRole is

How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?

风格不统一 提交于 2019-12-18 06:20:11
问题 I would like to upgrade my AWS data pipeline definition to EMR 4.x or 5.x, so I can take advantage of Hive's latest features (version 2.0+), such as CURRENT_DATE and CURRENT_TIMESTAMP , etc. The change from EMR 3.x to 4.x/5.x requires the use of releaseLabel in EmrCluster , versus amiVersion . When I use a "releaseLabel": "emr-4.1.0" , I get the following error: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask Below is my data pipeline definition, for EMR

Pararelization of sklearn Pipeline

匆匆过客 提交于 2019-12-14 02:06:26
问题 I have a set of Pipelines and want to have multi-threaded architecture. My typical Pipeline is shown below: huber_pipe = Pipeline([ ("DATA_CLEANER", DataCleaner()), ("DATA_ENCODING", Encoder(encoder_name='code')), ("SCALE", Normalizer()), ("FEATURE_SELECTION", huber_feature_selector), ("MODELLING", huber_model) ]) Is it possible to run the steps of the pipeline in different threads or cores? 回答1: In general, no. If you look at the interface for sklearn stages, the methods are of the form: fit