Run a python script via AWS Data Pipelines

柔情痞子 提交于 2019-12-02 06:24:51

问题


I use AWS Data Pipelines to run nightly SQL queries that populate tables for summary statistics. The UI's a bit funky, but eventually I got it up and working.

Now I'd like to do something similar with a python script. I have a file that I run every morning on my laptop (forecast_rev.py) but of course that means I have to turn on my laptop and kick this off every day. Surely I can schedule a Pipeline to do the same thing, and thus go away on vacation and not care.

For the life of me, I can't find a tutorial, AWS doc, or StackOverflow about this! I'm not even sure how to get started. Does anyone have a simple pipeline they'd be willing to share steps on?


回答1:


I faced similar situation, here how i over come it.
I am going to describe how i did it with Ec2Resource. if you are looking for solution in EMRCluster refer @franklinsijo answer.

Steps
1. Store your python script in s3.
2. create a shell script(hello.sh)(given bellow) and store it to s3
3. Create a Ec2Resource Node and ShellCommandActivity Node and provide these information.

  • Provide shell script S3 url in "Script Uri" and set "stage" to true in ShellCommandActivity. And it should runs on your DefaultResource

Here is the shell script(hello.sh) which download your python program from s3 and stores locally, install python and required 3rd party library and finally execute your python file.

hello.sh

echo 'Download python file to local temp'
aws s3 cp s3://path/to/python_file/hello_world.py /tmp/hello.py
# Install python(on CentOs )
sudo yum -y install python-pip
pip install <dependencies>
python /tmp/hello.py

I had hard time while trying with bang line so do not included them here.
if aws cp command doesn't works(awscli is older), here is a quick solution for this case.

  1. Follow step 1-3 above, along with that create a s3DataNode.
    I. provide your python s3 url in "File Path" of S3DataNode.
    II. provide DataNode as "input" to ShellCommandActivity
    III. write following command in "command" field of ShellCommandActivity

Command

echo 'Install Python2'
sudo yum -y install python-pip
pip install <dependencies>
python ${INPUT1_STAGING_DIR}/hello_world.py



回答2:


  1. You need to store your python script on S3 bucket
  2. Create Shell script that installs python and all your dependencies, copies your python script from S3 to local storage and runs it. Shell script example.
  3. Store this shell script on S3
  4. Use ShellCommandActivity to launch your shell script.

You can use this template as an example: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-redshiftrdsfull.html It uses python script stored on s3 to convert MySQL schema to RedShift schema.

Example of python shell script that runs python program:

#!/bin/bash
curl -O https://s3.amazonaws.com/datapipeline-us-east-1/sample-scripts/mysql_to_redshift.py
python mysql_to_redshift.py


来源:https://stackoverflow.com/questions/43456182/run-a-python-script-via-aws-data-pipelines

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!