org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout

问题

Getting the below error with respect to the container while submitting an spark application to YARN. The HADOOP(2.7.3)/SPARK (2.1) environment is running a pseudo-distributed mode in a single node cluster. The application works perfectly when made to run in local model however trying to check its correctness in a cluster mode using YARN as RM and hit some roadblock. New to this world hence looking for help.

--- Applications logs

2017-04-11 07:13:28 INFO  Client:58 - Submitting application 1 to ResourceManager
2017-04-11 07:13:28 INFO  YarnClientImpl:174 - Submitted application application_1491909036583_0001 to ResourceManager at /0.0.0.0:8032
2017-04-11 07:13:29 INFO  Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED)
2017-04-11 07:13:29 INFO  Client:58 - 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1491909208425
     final status: UNDEFINED
     tracking URL: http://ip-xxx.xx.xx.xxx:8088/proxy/application_1491909036583_0001/
     user: xxxx
2017-04-11 07:13:30 INFO  Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED)
2017-04-11 07:13:31 INFO  Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED)
2017-04-11 07:13:32 INFO  Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED)
2017-04-11 07:17:37 INFO  Client:58 - Application report for application_1491909036583_0001 (state: FAILED)
2017-04-11 07:17:37 INFO  Client:58 - 
     client token: N/A
     diagnostics: Application application_1491909036583_0001 failed 2 times due to AM Container for appattempt_1491909036583_0001_000002 exited with  exitCode: 10
For more detailed output, check application tracking page:http://"hostname":8088/cluster/app/application_1491909036583_0001Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1491909036583_0001_02_000001
Exit code: 10
Stack trace: ExitCodeException exitCode=10: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
    at org.apache.hadoop.util.Shell.run(Shell.java:479)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

****--- Container Logs****

2017-04-11 07:13:30 INFO  ApplicationMaster:47 - Registered signal handlers for [TERM, HUP, INT]
2017-04-11 07:13:31 INFO  ApplicationMaster:59 - ApplicationAttemptId: appattempt_1491909036583_0001_000001
2017-04-11 07:13:32 INFO  SecurityManager:59 - Changing view acls to: root,xxxx
2017-04-11 07:13:32 INFO  SecurityManager:59 - Changing modify acls to: root,xxxx
2017-04-11 07:13:32 INFO  SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, xxxx); users with modify permissions: Set(root, xxxx)
2017-04-11 07:13:32 INFO  Slf4jLogger:80 - Slf4jLogger started
2017-04-11 07:13:32 INFO  Remoting:74 - Starting remoting
2017-04-11 07:13:32 INFO  Remoting:74 - Remoting started; listening on addresses :[akka.tcp://sparkYarnAM@xxx.xx.xx.xxx:45446]
2017-04-11 07:13:32 INFO  Remoting:74 - Remoting now listens on addresses: [akka.tcp://sparkYarnAM@xxx.xx.xx.xxx:45446]
2017-04-11 07:13:32 INFO  Utils:59 - Successfully started service 'sparkYarnAM' on port 45446.
2017-04-11 07:13:32 INFO  ApplicationMaster:59 - Waiting for Spark driver to be reachable.
2017-04-11 07:13:32 INFO  ApplicationMaster:59 - Driver now available: xxx.xx.xx.xxx:47503
2017-04-11 07:15:32 ERROR ApplicationMaster:96 - Uncaught exception: 
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:98)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:116)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runAMEndpoint(ApplicationMaster.scala:279)
    at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:473)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:315)
    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:157)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:625)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:623)
    at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:646)
    at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:107)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:241)
    ... 16 more
2017-04-11 07:15:32 INFO  ApplicationMaster:59 - Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout)
2017-04-11 07:15:32 INFO  ShutdownHookManager:59 - Shutdown hook called

--Yarn Node Manager logs at the time of failure

2017-04-11 07:15:18,728 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
2017-04-11 07:15:21,735 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
2017-04-11 07:15:24,742 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
2017-04-11 07:15:27,749 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
2017-04-11 07:15:30,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
2017-04-11 07:15:33,018 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1491909036583_0001_01_000001 is : 10
2017-04-11 07:15:33,019 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1491909036583_0001_01_000001 and exit code: 10
ExitCodeException exitCode=10: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)

-- SparkCOntext parameters

<!-- Spark Configuration -->
<bean id="sparkInfo" class="SparkInfo">
    <property name="appName" value="framework"></property>
    <property name="master" value="yarn-client"></property>
    <property name="dynamicAllocation" value="false"></property>
    <property name="executorInstances" value="2"></property>
    <property name="executorMemory" value="1g"></property>
    <property name="executorCores" value="4"></property>
    <property name="executorCoresMax" value="2"></property>
    <property name="taskCpus" value="4"></property>
    <property name="executorClassPath" value="/usr/hadoop/hadoop-2.7.3/share/hadoop/yarn/lib/*"></property>
    <property name="yarnJar"
        value="${framework.hdfsURI}/app/spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar"></property>
    <property name="yarnQueue" value="default"></property>
    <property name="memoryFraction" value="0.4"></property>
</bean>

sparks.default.conf

spark.driver.memory              1g
spark.executor.extraJavaOptions   -XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m
spark.rpc.lookupTimeout          600s

yarn-site.xml

<!-- Site specific YARN configuration properties -->
  <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1024</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>3096</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>3096</value>
  </property>
  <property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>4</value>
  </property>
</configuration>

回答1:

You can keep increasing spark.network.timeout until you stop seeing the problem , as mentioned by himanshuIIITian in comment.
When spark is under heavy workload, timeout exception can occur. If you have low executor memory then GC may keep system very busy which increases workload. Look into the logs if there is Out Of Memory error. Please enable -XX:+PrintGCDetails -XX:+PrintGCTimeStamps in spark.executor.extraJavaOptions and look into logs if full GC is invoked a number of times before a task completes. If that is the case then increase your executorMemory . That should hopefully solve your problem.

回答2:

for me it is the firewall settings in spark cluster which prevents the executors from connecting correctly, the problem I couldn't figure that promptly as spark UI shows all workers connected to the master, but there are other connections blocked by my firewall. After setting the following ports and allowing them in the firewall problem solved. ( please note that Spark use a random port for these settings by default)

spark.driver.port                    
spark.blockManager.port

来源：https://stackoverflow.com/questions/43346855/org-apache-spark-rpc-rpctimeoutexception-futures-timed-out-after-120-seconds

标签

apache-spark

apache-spark-sql

yarn

hadoop2