Google Cloud Composer(Airflow) - dataflow job inside a DAG executes successfully, but the DAG fails

后端 未结 2 874
臣服心动
臣服心动 2021-01-17 22:36

My DAG looks like this

default_args = {
    \'start_date\': airflow.utils.dates.days_ago(0),
    \'retries\': 0,
    \'dataflow_default_options\': {
                 


        
2条回答
  •  余生分开走
    2021-01-17 22:47

    Since the fix hasn't been released yet although it is merged to the master I will add the following workaround for anyone that needs to use a more recent Beam SDK version than 2.19.0.

    The idea is to implement the fix in a custom hook (identical to dataflow_hook.py but with the suggested change applied) and then implement a custom operator that uses this hook. Here is how I did it:

    First, I created a file named my_dataflow_hook.py:

    import re
    
    from airflow.contrib.hooks.gcp_dataflow_hook import DataFlowHook, _Dataflow, _DataflowJob
    from airflow.contrib.hooks.gcp_api_base_hook import GoogleCloudBaseHook
    
    
    class _myDataflow(_Dataflow):
        @staticmethod
        def _extract_job(line):
            job_id_pattern = re.compile(
                br".*console.cloud.google.com/dataflow.*/jobs/.*/([a-z|0-9|A-Z|\-|\_]+).*")
            matched_job = job_id_pattern.search(line or '')
            if matched_job:
                return matched_job.group(1).decode()
    
    
    class MyDataFlowHook(DataFlowHook):
        @GoogleCloudBaseHook._Decorators.provide_gcp_credential_file
        def _start_dataflow(self, variables, name, command_prefix, label_formatter):
            variables = self._set_variables(variables)
            cmd = command_prefix + self._build_cmd(variables, label_formatter)
            job_id = _myDataflow(cmd).wait_for_done()
            _DataflowJob(self.get_conn(), variables['project'], name,
                         variables['region'],
                         self.poll_sleep, job_id,
                         self.num_retries).wait_for_done()
    

    Then, I created a file named my_dataflow_java_operator.py:

    import copy
    
    from airflow.contrib.operators.dataflow_operator import DataFlowJavaOperator, GoogleCloudBucketHelper
    from hooks.my_dataflow_hook import MyDataFlowHook
    from airflow.plugins_manager import AirflowPlugin
    
    
    class MyDataFlowJavaOperator(DataFlowJavaOperator):
        def execute(self, context):
            bucket_helper = GoogleCloudBucketHelper(
                self.gcp_conn_id, self.delegate_to)
            self.jar = bucket_helper.google_cloud_to_local(self.jar)
            hook = MyDataFlowHook(gcp_conn_id=self.gcp_conn_id,
                                delegate_to=self.delegate_to,
                                poll_sleep=self.poll_sleep)
    
            dataflow_options = copy.copy(self.dataflow_default_options)
            dataflow_options.update(self.options)
    
            hook.start_java_dataflow(self.job_name, dataflow_options,
                                     self.jar, self.job_class)
    
    class MyDataFlowPlugin(AirflowPlugin):
        """Expose Airflow operators."""
    
        name = 'dataflow_fix_plugin'
        operators = [MyDataFlowJavaOperator]
    

    Finally, I uploaded these files into the bucket of the Composer environment following this structure:

    ├── dags
    │   └── my_dag.py
    └── plugins
        ├── hooks
        │   └── my_dataflow_hook.py
        └── my_dataflow_java_operator.py
    

    Now, I can create tasks with MyDataFlowJavaOperator in my DAGs:

    from airflow import DAG
    from airflow.operators.dataflow_fix_plugin import MyDataFlowJavaOperator
    ...
    with DAG("df-custom-test", default_args=default_args) as dag:
        test_task = MyDataFlowJavaOperator(dag=dag, task_id="df-java", py_file=PY_FILE, job_name=JOB_NAME)
    

    Of course you can do the same with the DataFlowPythonOperator or the DataflowTemplateOperator if needed.

提交回复
热议问题