Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

后端未结

关注

 13  1413

说谎

I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html

I am running as below

command on ec2 inst

相关标签:

13条回答

北海茫月

2020-12-05 00:17
There are three ways we can try to fix this issue.
1. Check for spark process on your machine and kill it.
Do
```
ps aux | grep spark
```
Take all the process id's with spark processes and kill them, like
```
sudo kill -9 4567 7865
```
1. Check for number of spark applications running on your cluster.
To check this, do
```
yarn application -list
```
you will get an output similar to this:
```
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
                Application-Id      Application-Name        Application-Type          User       Queue               State         Final-State         Progress                        Tracking-URL
application_1496703976885_00567       ta da                SPARK        cloudera       default             RUNNING           UNDEFINED              20%             http://10.0.52.156:9090
```
Check for the application id's, if they are more than 1, or more than 2, kill them. Your cluster cannot run more than 2 spark applications at the same time. I am not 100% sure about this, but on cluster if you run more than two spark applications, it will start complaining. So, kill them Do this to kill them:
```
yarn application -kill application_1496703976885_00567
```
1. Check for your spark config parameters. For example, if you have set more executor memory or driver memory or number of executors on your spark application that may also cause an issue. So, reduce of any of them and run your spark application, that might resolve it.
0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2020-12-05 00:20
I faced the same issue in clouera quickstart VM when I tried to execute pyspark shell . Whe i see the job logs in resourcemanager , i see
```
17/02/18 22:20:53 ERROR yarn.ApplicationMaster: Failed to connect to driver at RM IP. 
```
That means job is not able to connect to RM (resource manager) because by default pyspark try to launch in yarn mode in cloudera VM .
```
pyspark --master local 
```
worked for me . Even starting RM s resolved the issue.

Thanks
0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2020-12-05 00:22
This suggests that YARN cannot assign resources for the new App you are submitting. Try to reduce the resources for the container you are asking for (see here), or try this on a less busy cluster.

Another thing to try is check if YARN works properly as a service:
```
sudo service hadoop-yarn-nodemanager status
sudo service hadoop-yarn-resourcemanager status
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
迷失自我

2020-12-05 00:22
In my case, I see some old spark processes (which are stopped by Ctrl+Z) are still running and their AppMasters (drivers) probably still occupying memory. So, the new AppMaster's from new spark command may be waiting indefinitely to get registered by YarnScheduler, as spark.driver.memory cannot be allocated in respective core nodes. This can also occur when Max resource allocation is true and if the driver is set to use Max resources available for a core-node.

So, I identified all those stale spark client processes and killed them (which may had killed their Drivers and released memory).
```
ps -aux | grep spark

hadoop    3435  1.4  3.0 3984908 473520 pts/1  Tl   Feb17   0:12  .. org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=1G --class org.apache.spark.examples.SparkPi /usr/lib/spark/lib/spark-examples.jar 10

hadoop   32630  0.9  3.0 3984908 468928 pts/1  Tl   Feb17   0:14 .. org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=1G --class org.apache.spark.examples.SparkPi /usr/lib/spark/lib/spark-examples.jar 1000

    kill -9 3435 32630
```
After that I do not see those messages.
0 讨论(0)
发布评论:

提交评论
- 加载中...
-上瘾入骨i

2020-12-05 00:22

Had a similar problem

Like other answer indicate here, it's a resource availability issue

In my case, I was doing an etl process where the old data from the previous run was being trashed each time. However, the newly trashed data was being stored in the controlling user's /user/myuser/.Trash folder. Looking at the Ambari dashboard, I could see that the overall HDFS disk usage was near capacity which was causing the resource issues.

So in this case, used the -skipTrash option to hadoop fs -rm ... old data files (else will take up space in trash roughly equivalent to the size of all data stored in the etl storage dir (effectively doubling total the space used by application and causing resource problems)).

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-05 00:24

I am on a slightly different setup using CDH 5.4. I think the cause of this issue on my setup is something getting stuck because of an error (file already exists, etc.), because this happens after some other part of my code errors out and a try to fix and kick it off again.

I can get past this by restarting all services on the cluster in cloudera manager, so I agree with earlier answers that it's probably due to resources that are allocated to something that error-ed out and you need to reclaim those resources to be able to run again, or allocate them differently to begin with.

e.g. my cluster has 4 executors available to it. In SparkConf for one process, I set spark.executor.instances to 4. While that process is still running, potentially hung up for some reason, I kick off another job (either same way, or with spark-submit), with spark.executor.instances set to 1 ("--num-executors 1 " if using spark-submit). I only have 4, and 4 are allocated to the earlier process, so this one which is asking for 1 executor has to wait in line.

0 讨论(0)
发布评论:

提交评论
- 加载中...