I am using Dataproc to run Spark commands over a cluster using spark-shell. I frequently get error/warning messages indicating that I lose connection with my executors. The messages look like this:
[Stage 6:> (0 + 2) / 2]16/01/20 10:10:24 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 5 on spark-cluster-femibyte-w-0.c.gcebook-1039.internal: remote Rpc client disassociated
16/01/20 10:10:24 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-cluster- femibyte-w-0.c.gcebook-1039.internal:60599] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
16/01/20 10:10:24 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.2 in stage 6.0 (TID 17, spark-cluster-femibyte-w-0.c.gcebook-1039.internal): ExecutorLostFailure (executor 5 lost)
16/01/20 10:10:24 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.2 in stage 6.0 (TID 16, spark-cluster-femibyte-w-0.c.gcebook-1039.internal): ExecutorLostFailure (executor 5 lost)
...
Here is another sample:
20 10:51:43 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 2 on spark-cluster-femibyte-w-1.c.gcebook-1039.internal: remote Rpc client disassociated
16/01/20 10:51:43 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-cluster-femibyte-w-1.c.gcebook-1039.internal:58745] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
16/01/20 10:51:43 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 5, spark-cluster-femibyte-w-1.c.gcebook-1039.internal): ExecutorLostFailure (executor 2 lost)
16/01/20 10:51:43 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 (TID 4, spark-cluster-femibyte-w-1.c.gcebook-1039.internal): ExecutorLostFailure (executor 2 lost)
16/01/20 10:51:43 WARN org.apache.spark.ExecutorAllocationManager: Attempted to mark unknown executor 2 idle
Is this normal ? Is there anything I can do to prevent this ?
If the job itself isn't failing, with the fact that you're not seeing other propagated errors associated with actual task failures (at least as far as I can tell from what's posted in the question) most likely you're just seeing the harmless but known to be spammy issue in core Spark; the key here is that Spark dynamic allocation relinquishes underused executors during a job, and re-allocates them as needed. They originally failed to suppress the executor-lost part of it, but we've tested tomake sure it has no ill effects on the actual job.
Here's a googlegroups thread highlighting some of the behavioral details of Spark on YARN.
To check whether it's indeed dynamic allocation causing the messages, try running:
spark-shell --conf spark.dynamicAllocation.enabled=false \
--conf spark.executor.instances=99999
Or if you're submitting jobs through gcloud beta dataproc jobs
, then:
gcloud beta dataproc jobs submit spark \
--properties spark.dynamicAllocation.enabled=false,spark.executor.instances=99999
If you're really seeing network hiccups or other Dataproc errors disassociating the master/worker when it's not an application-side OOM or something, you can email the Dataproc team directly at dataproc-feedback@google.com; beta would be no excuse for latent broken behavior (though of course we hope to weed out tricky edge-case bugs that we may not have discovered yet during the beta period).
来源:https://stackoverflow.com/questions/34897150/google-dataproc-disconnect-with-executors-often