Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

女生的网名这么多〃 提交于 2019-12-08 01:34:36

问题


I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.

Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.

What is best way to track Spark job using Airflow if even I submitted?


回答1:


My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:

  • Specifying remote master IP: Requires modifying global configurations / environment variables
  • Using SSHOperator: SSH connection might break
  • Using EmrAddStepsOperator: Dependent on EMR

Regarding tracking

  • Livy only reports state and not progress (% completion of stages)
  • If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)

Other considerations

  • Livy doesn't support reusing SparkSession for POST/batches request
  • If that's imperative, you'll have to write your application code in PySpark and use POST/session requests

References

  • How to submit Spark jobs to EMR cluster from Airflow?
  • livy/examples/pi_app
  • rssanders3/livy_spark_operator_python_example

Useful links

  • How to submit Spark jobs to EMR cluster from Airflow?
  • Remote spark-submit to YARN running on EMR


来源:https://stackoverflow.com/questions/54228651/spark-job-submission-using-airflow-by-submitting-batch-post-method-on-livy-and-t

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!