问题
I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
What is best way to track Spark job using Airflow if even I submitted?
回答1:
My assumption is you an application JAR
containing Java
/ Scala
code that you want to submit to remote Spark
cluster. Livy
is arguably the best option for remote spark-submit
when evaluated against other possibilities:
- Specifying remote
master
IP: Requires modifying global configurations / environment variables - Using
SSHOperator
:SSH
connection might break - Using
EmrAddStepsOperator
: Dependent onEMR
Regarding tracking
Livy
only reportsstate
and not progress (% completion of stages)- If your'e OK with that, you can just poll the
Livy
server viaREST
API and keep printing logs in console, those will appear on task logs in WebUI (View Logs
)
Other considerations
Livy
doesn't support reusingSparkSession
forPOST/batches
request- If that's imperative, you'll have to write your application code in
PySpark
and usePOST/session
requests
References
- How to submit Spark jobs to EMR cluster from Airflow?
- livy/examples/pi_app
- rssanders3/livy_spark_operator_python_example
Useful links
- How to submit Spark jobs to EMR cluster from Airflow?
- Remote spark-submit to YARN running on EMR
来源:https://stackoverflow.com/questions/54228651/spark-job-submission-using-airflow-by-submitting-batch-post-method-on-livy-and-t