I am creating a job to parse massive amounts of server data, and then upload it into a Redshift
database.
My job flow is as follows:
Actually,
I've gone with AWS's Step Functions, which is a state machine wrapper for Lambda functions, so you can use boto3
to start the EMR Spark job using run_job_flow and you can use describe_cluaster to get the status of the cluster. Finally use a choice. SO your step functions look something like this (step function types in brackets:
Run job (task) -> Wait for X min (wait) -> Check status (task) -> Branch (choice) [ => back to wait, or => done ]