Workflow scheduling on GCP Dataproc cluster

落爺英雄遲暮 提交于 2020-02-24 03:56:08

问题


I have some complex Oozie workflows to migrate from on-prem Hadoop to GCP Dataproc. Workflows consist of shell-scripts, Python scripts, Spark-Scala jobs, Sqoop jobs etc.

I have come across some potential solutions incorporating my workflow scheduling needs:

  1. Cloud Composer
  2. Dataproc Workflow Template with Cloud Scheduling
  3. Install Oozie on Dataproc auto-scaling cluster

Please let me know which option would be most efficient in terms of performance, costing and migration complexities.


回答1:


All 3 are reasonable options (though #2 Scheduler+Dataproc is the most clunky). A few questions to consider: how often do your workflows run, how tolerant are you to unused VMs, how complex are your Oozie workflows, and how willing are you to invest time into migration?

Dataproc's workflows support branch/join but lack other Oozie features such as what to do on job failure, decision nodes, etc. If you use any of these, I'd would not even consider a direct migration to Workflow Templates and choose either #3 or the hybrid migration below.

A good place to start, would be hybrid migration (this is assuming your clusters are sparsely used). Keep your Oozie workflows and have Composer + Workflow Templates create a cluster with Oozie, use init action to stage your Oozie XML files + job jars/artifacts, add a single pig sh job from a Workflow to trigger Oozie via CLI.



来源:https://stackoverflow.com/questions/59142107/workflow-scheduling-on-gcp-dataproc-cluster

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!