How do you automate pyspark jobs on emr using boto3 (or otherwise)?

前端 未结 4 1301
悲&欢浪女
悲&欢浪女 2021-02-01 08:44

I am creating a job to parse massive amounts of server data, and then upload it into a Redshift database.

My job flow is as follows:

  • Grab the
4条回答
  •  伪装坚强ぢ
    2021-02-01 08:56

    I put a complete example on GitHub that shows how to do all of this with Boto3.

    The long-lived cluster example shows how to create and run job steps on a cluster that grabs data from a public S3 bucket that contains historical Amazon review data, do some PySpark processing on it, and write the output back to an S3 bucket.

    • Creates an Amazon S3 bucket and uploads a job script.
    • Creates AWS Identity and Access Management (IAM) roles used by the demo.
    • Creates Amazon Elastic Compute Cloud (Amazon EC2) security groups used by the demo.
    • Creates short-lived and long-lived clusters and runs job steps on them.
    • Terminates clusters and cleans up all resources.

提交回复
热议问题