This is a follow up to Spark streaming on dataproc throws FileNotFoundException
Over the past few weeks (not sure since exactly when), restart of a spark streaming j
We've recently added auto-restart capabilities to dataproc jobs (available in gcloud beta
track and in v1
API).
To take advantage of auto-restart, a job must be able to recover/cleanup so it will not work for most jobs without modification. However, it does work out of the box with Spark streaming with checkpoint files.
The restart-dataproc-agent trick should no longer be necessary. Auto-restart is resilient against Job crashes, Dataproc Agent failures, and VM restart-on-migration events.
Example:
gcloud beta dataproc jobs submit spark ... --max-failures-per-hour 1
See: https://cloud.google.com/dataproc/docs/concepts/restartable-jobs
If you want to test out recovery, you can simulate VM migration by restarting the master VM [1]. After this you should be able to describe the job [2] and see ATTEMPT_FAILURE
entry in statusHistory.
[1] gcloud compute instances reset
[2] gcloud dataproc jobs describe