问题
I am new to aws and trying to create a transient cluster on AWS emr to run a Python script. I just want to run the python script that will process the file and auto terminate the cluster post completion. I have also created a keypair and specified the same.
Command below :
aws emr create-cluster --name "test1-cluster" --release-label emr-5.5.0 --name pyspark_analysis --ec2-attributes KeyName=k-key-pair --applications Name=Hadoop Name=Hive Name=Spark --instance-groups --use-default-roles --instance-type m5-xlarge --instance-count 2 --region us-east-1 --log-uri s3://k-test-bucket-input/logs/ --steps Type=SPARK, Name="pyspark_analysis", ActionOnFailure=CONTINUE, Args=[-deploy-mode,cluster, -master,yarn, -conf,spark.yarn.submit.waitAppCompletion=true, -executor-memory,1g, s3://k-test-bucket-input/word_count.py, s3://k-test-bucket-input/input/a.csv, s3://k-test-bucket-input/output/ ] --auto-terminate
Error message
zsh: bad pattern: Args=[
What I tried :
I looked at the args and the spaces and if accidental characters are introduced or not but does not look like. Surely my syntax is wrong but not sure what I am missing.
What command is expected to do:
its expected to execute word_count.py by reading the input file a.csv and generating the output in b.csv
回答1:
I think the issue is with the use of spaces in --steps
. I formatted the command, so its a bit easier to read where are the spaces (or luck of them):
aws emr create-cluster \
--name "test1-cluster" \
--release-label emr-5.5.0 \
--name pyspark_analysis \
--ec2-attributes KeyName=k-key-pair \
--applications Name=Hadoop Name=Hive Name=Spark \
--instance-groups --use-default-roles \
--instance-type m5-xlarge --instance-count 2 \
--region us-east-1 --log-uri s3://k-test-bucket-input/logs/ \
--steps Type=SPARK,Name="pyspark_analysis",ActionOnFailure=CONTINUE,Args=[-deploy-mode,cluster,-master,yarn,-conf,spark.yarn.submit.waitAppCompletion=true,-executor-memory,1g,s3://k-test-bucket-input/word_count.py,s3://k-test-bucket-input/input/a.csv,s3://k-test-bucket-input/output/] \
--auto-terminate
回答2:
Try enclosing everything in quotes
aws emr create-cluster \
--name "test1-cluster" \
--release-label emr-5.5.0 \
--name pyspark_analysis \
--ec2-attributes KeyName=k-key-pair \
--applications Name=Hadoop Name=Hive Name=Spark \
--instance-groups --use-default-roles \
--instance-type m5-xlarge --instance-count 2 \
--region us-east-1 --log-uri s3://k-test-bucket-input/logs/ \
--steps Type="SPARK",Name="pyspark_analysis",ActionOnFailure="CONTINUE",Args=[-deploy-mode,cluster,-master,yarn,-conf,spark.yarn.submit.waitAppCompletion=true,-executor-memory,1g,s3://k-test-bucket-input/word_count.py,s3://k-test-bucket-input/input/a.csv,s3://k-test-bucket-input/output/] \
--auto-terminate
Visit here for more info https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html
and yes spark can be used
aws emr create-cluster --name "Add Spark Step Cluster" --release-label emr-5.30.1 --applications Name=Spark \
--ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3 \
--steps Type=Spark,Name="Spark Program",ActionOnFailure=CONTINUE,Args=[--class,org.apache.spark.examples.SparkPi,/usr/lib/spark/examples/jars/spark-examples.jar,10] --use-default-roles
来源:https://stackoverflow.com/questions/62928662/facing-error-while-trying-to-create-transient-cluster-on-aws-emr-to-run-python-s