I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html
I am running as below
command on ec2 inst
I had a small cluster where the resources were limited (~3GB per node). Solved this problem by changing the minimum memory allocation to a sufficiently low number.
From:
yarn.scheduler.minimum-allocation-mb: 1g
yarn.scheduler.increment-allocation-mb: 512m
To:
yarn.scheduler.minimum-allocation-mb: 256m
yarn.scheduler.increment-allocation-mb: 256m
I had the same problem on a local hadoop cluster with spark 1.4 and yarn, trying to run spark-shell. It had more then enough resources.
What helped was running the same thing from an interactive lsf job on the cluster. So perhaps there were some network limitations to run yarn from the head node...
When running with yarn-cluster all the application logging and stdout will be located in the assigned yarn application master and will not appear to spark-submit. Also being streaming the application usually does not exit. Check the Hadoop resource manager web interface and look at the Spark web ui and logs that will be available from the Hadoop ui.
In one instance, I had this issue because I was asking for too many resources. This was on a small standalone cluster. The original command was
spark-submit --driver-memory 4G --executor-memory 7G -class "my.class" --master yarn --deploy-mode cluster --conf spark.yarn.executor.memoryOverhead my.jar
I succeeded in getting past 'Accepted' and into 'Running' by changing to
spark-submit --driver-memory 1G --executor-memory 3G -class "my.class" --master yarn --deploy-mode cluster --conf spark.yarn.executor.memoryOverhead my.jar
In other instances, I had this problem because of the way the code was written. We instantiated the spark context inside the class where it was used, and it did not get closed. We fixed the problem by instantiating the context first, passing it to the class where data is parallelized etc, then closing the context (sc.close()) in the caller class.
I hit the same problem MS Azure cluster in their HDinsight spark cluster.
finally found out the issue was the cluster couldn't be able to talk back to the driver.
I assume you used client mode when submit the job since you can provide this debug log.
reason why is that spark executors have to talk to driver program, and the TCP connection has to be bi-directional. so if your driver program is running in a VM(ec2 instance) which is not reachable via hostname or IP(you have to specify in spark conf, default to hostname), your status will be accepted forever.
I got this error in this situation:
Logs for container_1453825604297_0001_02_000001 (from ResourceManager web UI):
16/01/26 08:30:38 INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable. 16/01/26 08:31:41 ERROR yarn.ApplicationMaster: Failed to connect to driver at 192.168.1.180:33074, retrying ... 16/01/26 08:32:44 ERROR yarn.ApplicationMaster: Failed to connect to driver at 192.168.1.180:33074, retrying ... 16/01/26 08:32:45 ERROR yarn.ApplicationMaster: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver! at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:484)
I workaround it by using yarn cluster mode: MASTER=yarn-cluster.
On another computer which is configured in the similar way, but is's IP is reachable from the cluster, both yarn-client and yarn-cluster work.
Others may encounter this error for different reasons, and my point is that checking error logs (not seen from terminal, but ResourceManager web UI in this case) almost always helps.