Calling beam.io.WriteToBigQuery in a beam.DoFn

问题

I've created a dataflow template with some parameters. When I write the data to BigQuery, I would like to make use of these parameters to determine which table it is supposed to write to. I've tried calling WriteToBigQuery in a ParDo as suggested in the following link.

How can I write to Big Query using a runtime value provider in Apache Beam?

The pipeline ran successfully but it is not creating or loading data to BigQuery. Any idea what might be the issue?

def run():
  pipeline_options = PipelineOptions()
  pipeline_options.view_as(DebugOptions).experiments = ['use_beam_bq_sink']

  with beam.Pipeline(options=pipeline_options) as p:
    custom_options = pipeline_options.view_as(CustomOptions)

    _ = (
      p
      | beam.Create([None])
      | 'Year to periods' >> beam.ParDo(SplitYearToPeriod(custom_options.year))
      | 'Read plan data' >> beam.ParDo(GetPlanDataByPeriod(custom_options.secret_name))
      | 'Transform record' >> beam.Map(transform_record)
      | 'Write to BQ' >> beam.ParDo(WritePlanDataToBigQuery(custom_options.year))
    )

if __name__ == '__main__':
  run()

class CustomOptions(PipelineOptions):
  @classmethod
  def _add_argparse_args(cls, parser):
    parser.add_value_provider_argument('--year', type=int)
    parser.add_value_provider_argument('--secret_name', type=str)

class WritePlanDataToBigQuery(beam.DoFn):
  def __init__(self, year_vp):
    self._year_vp = year_vp

  def process(self, element):
    year = self._year_vp.get()

    table = f's4c.plan_data_{year}'
    schema = {
      'fields': [ ...some fields properties ]
    }

    beam.io.WriteToBigQuery(
      table=table,
      schema=schema,
      create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
      write_disposition=BigQueryDisposition.WRITE_TRUNCATE,
      method=beam.io.WriteToBigQuery.Method.FILE_LOADS
    )

回答1:

You have instantiated the PTransform beam.io.gcp.bigquery.WriteToBigQuery inside the process method of your DoFn. There are a couple of problems here:

The process method is called for each element of the input PCollection. It is not used for building the pipeline graph. This approach to dynamically constructing the graph will not work.
Once you move it out of the DoFn, you need to apply the PTransform beam.io.gcp.bigquery.WriteToBigQuery to a PCollection for it to have any effect. See the Beam pydoc or the Beam tutorial documentation.

To create a derived value provider for your table name, you would need a "nested" value provider. Unfortunately this is not supported for the Python SDK. You can use the value provider option directly, though.

As an advanced option, you may be interested in trying out "flex templates" which essentially package up your whole program as a docker image and execute it with parameters.

回答2:

If the objective is for the code to accept parameters instead of a hard-coded string for the table path, here is a way to achieve that:

Add the table parameters as CustomOptions
Inside your run function add the CustomOptions parameters as default string

...

class CustomOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_value_provider_argument(
            '--gcs_input_file_path',
            type=str,
            help='GCS Input File Path'
        )
        parser.add_value_provider_argument(
            '--project_id',
            type=str,
            help='GCP ProjectID'
        )
        parser.add_value_provider_argument(
            '--dataset',
            type=str,
            help='BigQuery DataSet Name'
        )
        parser.add_value_provider_argument(
            '--table',
            type=str,
            help='BigQuery Table Name'
        )

def run(argv=None):

    pipeline_option = PipelineOptions()
    pipeline = beam.Pipeline(options=pipeline_option)
    custom_options = pipeline_option.view_as(CustomOptions)
    pipeline_option.view_as(SetupOptions).save_main_session = True
    pipeline_option.view_as(DebugOptions).experiments = ['use_beam_bq_sink']

    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--gcp_project_id',
        type=str,
        help='GCP ProjectID',
        default=str(custom_options.project_id)
    )
    parser.add_argument(
        '--dataset',
        type=str,
        help='BigQuery DataSet Name',
        default=str(custom_options.dataset)
    )
    parser.add_argument(
        '--table',
        type=str,
        help='BigQuery Table Name',
        default=str(custom_options.table)
    )

    static_options, _ = parser.parse_known_args(argv)
    path = static_options.gcp_project_id + ":" + static_options.dataset + "." + static_options.table

    data = (
            pipeline
            | "Read from GCS Bucket" >>
            beam.io.textio.ReadFromText(custom_options.gcs_input_file_path)
            | "Parse Text File" >>
            beam.ParDo(Split())
            | 'WriteToBigQuery' >>
            beam.io.WriteToBigQuery(
                path,
                schema=Schema,
                create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
            )
    )

    result = pipeline.run()
    result.wait_until_finish()


if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

Pass the table path at pipeline construction time in the shell file

python template.py \
  --dataset dataset_name \
  --table table_name \
  --project project_name \
  --runner DataFlowRunner \
  --region region_name \
  --staging_location gs://bucket_name/staging \
  --temp_location gs://bucket_name/temp \
  --template_location gs://bucket_name/templates/template_name

来源：https://stackoverflow.com/questions/62053420/calling-beam-io-writetobigquery-in-a-beam-dofn

标签

python

google-cloud-platform

google-bigquery

google-cloud-dataflow

apache-beam