EMR Cluster Creation using Airflow dag run, Once task is done EMR will be terminated

左心房为你撑大大i 提交于 2019-12-29 09:53:26

问题


I have Airflow jobs, which are running fine on the EMR cluster. what I need is, let's say if I have a 4 airflow jobs which required an EMR cluster for let's say 20 min to complete the task. why not we can create an EMR cluster at DAG run time and once the job is to finish it will terminate the created an EMR cluster.


回答1:


Absolutely, that would be the most efficient use of resources. Let me warn you: there are a lot of details in this; I'll try to list as many as would get you going. I encourage you to add your own comprehensive answer listing any problems that you encountered and the workaround (once you are through this)


Regarding cluster creation / termination

  • For cluster creation and termination, you have EmrCreateJobFlowOperator and EmrTerminateJobFlowOperator respectively

  • Don't fret if you do not use AWS SecretAccessKey (and rely wholly on IAM Roles); instantiating any AWS-related hook or operator in Airflow will automatically fall-back to underlying EC2's attached IAM Role

  • If your'e NOT using the EMR-Steps API for job-submission, then you'll also have to manually sense both the above operations using Sensors. There's already a sensor for polling creation phase called EmrJobFlowSensor and you can modify it slightly to create a sensor for termination too

  • You pass your cluster-config JSON in job_flow_extra. You can also pass configs in a Connection's (like my_emr_conn) extra param, but refrain from it because it often breaks SQLAlchemy ORM loading (since its a big json)


Regarding job submission

  • You either submit jobs to Emr using EMR-Steps API, which can be done either during cluster creation phase (within the Cluster-Configs JSON) or afterwards using add_job_flow_steps(). There's even an emr_add_steps_operator() in Airflow which also requires an EmrStepSensor. You can read more about it in AWS docs and you might also have to use command-runner.jar

  • For application-specific cases (like Hive, Livy), you can use their specific ways. For instance you can use HiveServer2Hook to submit a Hive job. Here's a tricky part: The run_job_flow() call (made during cluster-creation phase) only gives you a job_flow_id (cluster-id). You'll have to use a describe_cluster() call using EmrHook to obtain the private-IP of the master node. Using this you will then be able to programmatically create a Connection (such as Hive Server 2 Thrift connection) and use it for submitting your computations to cluster. And don't forget to delete those connections (for elegance) before completing your workflow.

  • Finally there's the good-old bash for interacting with cluster. For this you should also pass an EC2 key pair during cluster creation phase. Afterwards, you can programmatically create an SSH connection and use it (with an SSHHook or SSHOperator) for running jobs on your cluster. Read more about SSH-stuff in Airflow here

  • Particularly for submitting Spark jobs to remote Emr cluster, read this discussion





回答2:


The best way to do this is probably to have a node at the root of your Airflow DAG that creates the EMR cluster, and then another node at the very end of the DAG that spins the cluster down after all of the other nodes have completed.



来源:https://stackoverflow.com/questions/55227683/emr-cluster-creation-using-airflow-dag-run-once-task-is-done-emr-will-be-termin

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!