When I try to submit a spark streaming job to google dataproc cluster, I get this exception:
16/12/13 00:44:20 ERROR org.apache.spark.SparkContext: Error ini
Is it possible that this wasn't the first time you ran the job with the given checkpoint directory, as in the checkpoint directory already contains a checkpoint?
This happens because the checkpoint hard-codes the exact jarfile arguments used to submit the YARN application, and when running on Dataproc with a --jars
flag pointing to GCS, this is actually syntactic sugar for Dataproc automatically staging your jarfile from GCS into a local file path /tmp/0afbad25-cb65-49f1-87b8-9cf6523512dd/skyfall-assembly-0.0.1.jar
that's only used temporarily for the duration of a single job-run, since Spark isn't able to invoke the jarfile directly out of GCS without staging it locally.
However, in a subsequent job, the previous tmp jarfile will already be deleted, but the new job tries to refer to that old location hard-coded into the checkpoint data.
There are also additional issues caused by hard-coding in the checkpoint data; for example, Dataproc also uses YARN "tags" to track jobs, and will conflict with YARN if an old Dataproc job's "tag" is reused in a new YARN application. To run your streaming application, you'll need to first clear out your checkpoint directory if possible to start from a clean slate, and then:
When you specify a "file:///" path dataproc knows its already on the master node so it doesn't re-stage into a /tmp directory, so in that case it's safe for the checkpoint to point to some fixed local directory on the master.
You can do this either with an init action or you can submit a quick pig job (or just ssh into the master and download that jarfile):
# Use a quick pig job to download the jarfile to a local directory (for example /usr/lib/spark in this case)
gcloud dataproc jobs submit pig --cluster my-test-cluster \
--execute "fs -cp gs://my-bucket/resources/skyfall-assembly-0.0.1.jar file:///usr/lib/spark/skyfall-assembly-0.0.1.jar"
# Submit the first attempt of the job
gcloud dataproc jobs submit spark --cluster my-test-cluster \
--class com.company.skyfall.Skyfall \
--jars file:///usr/lib/spark/skyfall-assembly-0.0.1.jar \
--properties spark.ui.showConsoleProgress=false
For now, it only "cleans up" suspicious YARN applications as long as the recent killed jobid is held in memory, so rebooting the dataproc agent will fix this.
# Kill the job through the UI or something before the next step.
# Now use "pig sh" to restart the dataproc agent
gcloud dataproc jobs submit pig --cluster my-test-cluster \
--execute "sh systemctl restart google-dataproc-agent.service"
# Re-run your job without needing to change anything else,
# it'll be fine now if you ever need to resubmit it and it
# needs to recover from the checkpoint again.
Keep in mind though that by nature of checkpoints this means you won't be able to change the arguments you pass on subsequent runs, because the checkpoint recovery is used to clobber your command-line settings.
You can also run the job in yarn cluster mode to avoid adding jar to your master machine. The potential trade off is the spark driver will run in worker node instead of the master.