问题
I am running a spark-job on EMR cluster,The issue i am facing is all the
EMR jobs triggered are executing in steps (in queue)
Is there any way to make them run parallel if not is there any alteration for that
回答1:
Elastic MapReduce comes by default with a YARN setup very "step" oriented, with a single CapacityScheduler queue with the 100% of the cluster resources assigned. Because of this configuration, any time you submit a job to an EMR cluster, YARN maximizes the cluster usage for that single job, granting all available resources to it until it finishes.
Running multiple concurrent jobs in an EMR cluster (or any other YARN based Hadoop cluster, in fact) requires a proper YARN setup with multiple queues to properly grant resources to each job. YARN's documentation is quite good about all of the Capacity Scheduler features and it is simpler as it sounds.
YARN's FairScheduler is quite popular but it uses a different approach and may be a bit more difficult to configure depending on your needs. Given the simplest scenario where you have a single Fair queue, YARN will try to grant containers to waiting jobs as soon as they are freed by running jobs, ensuring that all the jobs submitted to a cluster get at least a fraction of compute resources as soon as they are available.
回答2:
If you are concerned about YARN jobs running in a queue(submitted by spark)..
There are multiple solutions to run jobs in parallel ,
By default, EMR uses YARN CapacityScheduler with DefaultResourceCalculator and has one single DEFAULT queue where all YARN jobs are submitted. SInce there is only one queue, the number of yarn jobs that you can RUN(not submit) in parallel really depends on the parallel number of AM's , mapper and reducers that your EMR cluster supports.
For example : You have a cluster that can run atmost 10 mappers in parallel. (see AWS EMR Parallel Mappers?)
Suppose you submitted 2 map-only jobs each requiring 10 mappers one after another. The first job will take up all mapper container capacity and runs , while the second waits on the queue for the containers to free up. This behavior is similar for AM's and Reducers as well.
Now, to make them run in parallel inspire of having that limitation on number of containers that is supported by cluster ,
Keeping capacity scheduler , You can create multiple queues configuring %'s of capacity with Max capacity in each queue. So that job in first queue might not fully use up all containers even though it needs it. You can submit a seconds your job in second queue which will have pre-determined capacity.
You might need to use FAIR scheduler by configuring yarn-site.xml . The FAIR scheduler allows you share configure queues and share resources across those queues fairly. You might also use PREEMPTION option of fair scheduler.
Note that the choice of what option to go with - really depends on your use-case and business needs. It is important to learn about all options and possible impact.
https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781491901687/ch04.html
回答3:
Amazon EMR now supports the ability to run multiple steps in parallel. The number of steps allowed to run at once is configurable and can be set when a cluster is launched and at any time after the cluster has started.
Please see this announcement for more details: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/.
来源:https://stackoverflow.com/questions/43121382/running-steps-of-emr-in-parallel