I am creating a job to parse massive amounts of server data, and then upload it into a Redshift
database.
My job flow is as follows:
I put a complete example on GitHub that shows how to do all of this with Boto3.
The long-lived cluster example shows how to create and run job steps on a cluster that grabs data from a public S3 bucket that contains historical Amazon review data, do some PySpark processing on it, and write the output back to an S3 bucket.