How do you automate pyspark jobs on emr using boto3 (or otherwise)?

前端 未结 4 1307
悲&欢浪女
悲&欢浪女 2021-02-01 08:44

I am creating a job to parse massive amounts of server data, and then upload it into a Redshift database.

My job flow is as follows:

  • Grab the
4条回答
  •  情话喂你
    2021-02-01 08:55

    Just do this using AWS Data Pipeline. You can setup your S3 bucket to trigger a lambda function every time a new file is placed inside the bucket https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html. Then your Lambda function will activate your Data Pipeline https://aws.amazon.com/blogs/big-data/using-aws-lambda-for-event-driven-data-processing-pipelines/ then your Data Pipeline spins up a new EMR Cluster using EmrCluster, then you can specify your bootstrap options, then you can run your EMR commands using EmrActivity, and when it's all done it'll terminate your EMR Cluster and deactivate the Data Pipeline.

提交回复
热议问题