I\'m trying to run a Spark ML pipeline (load some data from JDBC, run some transformers, train a model) on my Yarn cluster but each time I run it, a couple - sometimes one,
TLDR: Make sure your code is threadsafe and race condition-free before you blame Spark.
Figured it out. For posterity: was using an thread-unsafe data structure (a mutable HashMap). Since executors on the same machine share a JVM, this was resulting in data races that were locking up the separate threads/tasks.
The upshot: when you have spark.executor.cores > 1
(and you probably should), make sure your code is threadsafe.