Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

后端 未结 13 1413
说谎
说谎 2020-12-04 23:27

I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html

I am running as below

command on ec2 inst

相关标签:
13条回答
  • 2020-12-05 00:17

    There are three ways we can try to fix this issue.

    1. Check for spark process on your machine and kill it.

    Do

    ps aux | grep spark
    

    Take all the process id's with spark processes and kill them, like

    sudo kill -9 4567 7865
    
    1. Check for number of spark applications running on your cluster.

    To check this, do

    yarn application -list
    

    you will get an output similar to this:

    Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
                    Application-Id      Application-Name        Application-Type          User       Queue               State         Final-State         Progress                        Tracking-URL
    application_1496703976885_00567       ta da                SPARK        cloudera       default             RUNNING           UNDEFINED              20%             http://10.0.52.156:9090
    

    Check for the application id's, if they are more than 1, or more than 2, kill them. Your cluster cannot run more than 2 spark applications at the same time. I am not 100% sure about this, but on cluster if you run more than two spark applications, it will start complaining. So, kill them Do this to kill them:

    yarn application -kill application_1496703976885_00567
    
    1. Check for your spark config parameters. For example, if you have set more executor memory or driver memory or number of executors on your spark application that may also cause an issue. So, reduce of any of them and run your spark application, that might resolve it.
    0 讨论(0)
  • 2020-12-05 00:20

    I faced the same issue in clouera quickstart VM when I tried to execute pyspark shell . Whe i see the job logs in resourcemanager , i see

    17/02/18 22:20:53 ERROR yarn.ApplicationMaster: Failed to connect to driver at RM IP. 
    

    That means job is not able to connect to RM (resource manager) because by default pyspark try to launch in yarn mode in cloudera VM .

    pyspark --master local 
    

    worked for me . Even starting RM s resolved the issue.

    Thanks

    0 讨论(0)
  • 2020-12-05 00:22

    This suggests that YARN cannot assign resources for the new App you are submitting. Try to reduce the resources for the container you are asking for (see here), or try this on a less busy cluster.

    Another thing to try is check if YARN works properly as a service:

    sudo service hadoop-yarn-nodemanager status
    sudo service hadoop-yarn-resourcemanager status
    
    0 讨论(0)
  • 2020-12-05 00:22

    In my case, I see some old spark processes (which are stopped by Ctrl+Z) are still running and their AppMasters (drivers) probably still occupying memory. So, the new AppMaster's from new spark command may be waiting indefinitely to get registered by YarnScheduler, as spark.driver.memory cannot be allocated in respective core nodes. This can also occur when Max resource allocation is true and if the driver is set to use Max resources available for a core-node.

    So, I identified all those stale spark client processes and killed them (which may had killed their Drivers and released memory).

    ps -aux | grep spark
    
    hadoop    3435  1.4  3.0 3984908 473520 pts/1  Tl   Feb17   0:12  .. org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=1G --class org.apache.spark.examples.SparkPi /usr/lib/spark/lib/spark-examples.jar 10
    
    hadoop   32630  0.9  3.0 3984908 468928 pts/1  Tl   Feb17   0:14 .. org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=1G --class org.apache.spark.examples.SparkPi /usr/lib/spark/lib/spark-examples.jar 1000
    
        kill -9 3435 32630
    

    After that I do not see those messages.

    0 讨论(0)
  • 2020-12-05 00:22

    Had a similar problem

    Like other answer indicate here, it's a resource availability issue

    In my case, I was doing an etl process where the old data from the previous run was being trashed each time. However, the newly trashed data was being stored in the controlling user's /user/myuser/.Trash folder. Looking at the Ambari dashboard, I could see that the overall HDFS disk usage was near capacity which was causing the resource issues.

    So in this case, used the -skipTrash option to hadoop fs -rm ... old data files (else will take up space in trash roughly equivalent to the size of all data stored in the etl storage dir (effectively doubling total the space used by application and causing resource problems)).

    0 讨论(0)
  • 2020-12-05 00:24

    I am on a slightly different setup using CDH 5.4. I think the cause of this issue on my setup is something getting stuck because of an error (file already exists, etc.), because this happens after some other part of my code errors out and a try to fix and kick it off again.

    I can get past this by restarting all services on the cluster in cloudera manager, so I agree with earlier answers that it's probably due to resources that are allocated to something that error-ed out and you need to reclaim those resources to be able to run again, or allocate them differently to begin with.

    e.g. my cluster has 4 executors available to it. In SparkConf for one process, I set spark.executor.instances to 4. While that process is still running, potentially hung up for some reason, I kick off another job (either same way, or with spark-submit), with spark.executor.instances set to 1 ("--num-executors 1 " if using spark-submit). I only have 4, and 4 are allocated to the earlier process, so this one which is asking for 1 executor has to wait in line.

    0 讨论(0)
提交回复
热议问题