I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html
I am running as below
command on ec2 inst
There are three ways we can try to fix this issue.
Do
ps aux | grep spark
Take all the process id's with spark processes and kill them, like
sudo kill -9 4567 7865
To check this, do
yarn application -list
you will get an output similar to this:
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1496703976885_00567 ta da SPARK cloudera default RUNNING UNDEFINED 20% http://10.0.52.156:9090
Check for the application id's, if they are more than 1, or more than 2, kill them. Your cluster cannot run more than 2 spark applications at the same time. I am not 100% sure about this, but on cluster if you run more than two spark applications, it will start complaining. So, kill them Do this to kill them:
yarn application -kill application_1496703976885_00567
I faced the same issue in clouera quickstart VM when I tried to execute pyspark shell . Whe i see the job logs in resourcemanager , i see
17/02/18 22:20:53 ERROR yarn.ApplicationMaster: Failed to connect to driver at RM IP.
That means job is not able to connect to RM (resource manager) because by default pyspark try to launch in yarn mode in cloudera VM .
pyspark --master local
worked for me . Even starting RM s resolved the issue.
Thanks
This suggests that YARN cannot assign resources for the new App you are submitting. Try to reduce the resources for the container you are asking for (see here), or try this on a less busy cluster.
Another thing to try is check if YARN works properly as a service:
sudo service hadoop-yarn-nodemanager status
sudo service hadoop-yarn-resourcemanager status
In my case, I see some old spark processes (which are stopped by Ctrl+Z) are still running and their AppMasters (drivers) probably still occupying memory. So, the new AppMaster's from new spark command may be waiting indefinitely to get registered by YarnScheduler, as spark.driver.memory cannot be allocated in respective core nodes. This can also occur when Max resource allocation is true and if the driver is set to use Max resources available for a core-node.
So, I identified all those stale spark client processes and killed them (which may had killed their Drivers and released memory).
ps -aux | grep spark
hadoop 3435 1.4 3.0 3984908 473520 pts/1 Tl Feb17 0:12 .. org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=1G --class org.apache.spark.examples.SparkPi /usr/lib/spark/lib/spark-examples.jar 10
hadoop 32630 0.9 3.0 3984908 468928 pts/1 Tl Feb17 0:14 .. org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=1G --class org.apache.spark.examples.SparkPi /usr/lib/spark/lib/spark-examples.jar 1000
kill -9 3435 32630
After that I do not see those messages.
Had a similar problem
Like other answer indicate here, it's a resource availability issue
In my case, I was doing an etl process where the old data from the previous run was being trashed each time. However, the newly trashed data was being stored in the controlling user's /user/myuser/.Trash
folder. Looking at the Ambari dashboard, I could see that the overall HDFS disk usage was near capacity which was causing the resource issues.
So in this case, used the -skipTrash
option to hadoop fs -rm ...
old data files (else will take up space in trash roughly equivalent to the size of all data stored in the etl storage dir (effectively doubling total the space used by application and causing resource problems)).
I am on a slightly different setup using CDH 5.4. I think the cause of this issue on my setup is something getting stuck because of an error (file already exists, etc.), because this happens after some other part of my code errors out and a try to fix and kick it off again.
I can get past this by restarting all services on the cluster in cloudera manager, so I agree with earlier answers that it's probably due to resources that are allocated to something that error-ed out and you need to reclaim those resources to be able to run again, or allocate them differently to begin with.
e.g. my cluster has 4 executors available to it. In SparkConf for one process, I set spark.executor.instances to 4. While that process is still running, potentially hung up for some reason, I kick off another job (either same way, or with spark-submit), with spark.executor.instances set to 1 ("--num-executors 1 " if using spark-submit). I only have 4, and 4 are allocated to the earlier process, so this one which is asking for 1 executor has to wait in line.