问题
I've created a dataflow template with some parameters. When I write the data to BigQuery, I would like to make use of these parameters to determine which table it is supposed to write to. I've tried calling WriteToBigQuery in a ParDo as suggested in the following link.
How can I write to Big Query using a runtime value provider in Apache Beam?
The pipeline ran successfully but it is not creating or loading data to BigQuery. Any idea what might be the issue?
def run():
pipeline_options = PipelineOptions()
pipeline_options.view_as(DebugOptions).experiments = ['use_beam_bq_sink']
with beam.Pipeline(options=pipeline_options) as p:
custom_options = pipeline_options.view_as(CustomOptions)
_ = (
p
| beam.Create([None])
| 'Year to periods' >> beam.ParDo(SplitYearToPeriod(custom_options.year))
| 'Read plan data' >> beam.ParDo(GetPlanDataByPeriod(custom_options.secret_name))
| 'Transform record' >> beam.Map(transform_record)
| 'Write to BQ' >> beam.ParDo(WritePlanDataToBigQuery(custom_options.year))
)
if __name__ == '__main__':
run()
class CustomOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument('--year', type=int)
parser.add_value_provider_argument('--secret_name', type=str)
class WritePlanDataToBigQuery(beam.DoFn):
def __init__(self, year_vp):
self._year_vp = year_vp
def process(self, element):
year = self._year_vp.get()
table = f's4c.plan_data_{year}'
schema = {
'fields': [ ...some fields properties ]
}
beam.io.WriteToBigQuery(
table=table,
schema=schema,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=BigQueryDisposition.WRITE_TRUNCATE,
method=beam.io.WriteToBigQuery.Method.FILE_LOADS
)
回答1:
You have instantiated the PTransform beam.io.gcp.bigquery.WriteToBigQuery
inside the process
method of your DoFn
. There are a couple of problems here:
- The
process
method is called for each element of the input PCollection. It is not used for building the pipeline graph. This approach to dynamically constructing the graph will not work. - Once you move it out of the
DoFn
, you need to apply the PTransformbeam.io.gcp.bigquery.WriteToBigQuery
to a PCollection for it to have any effect. See the Beam pydoc or the Beam tutorial documentation.
To create a derived value provider for your table name, you would need a "nested" value provider. Unfortunately this is not supported for the Python SDK. You can use the value provider option directly, though.
As an advanced option, you may be interested in trying out "flex templates" which essentially package up your whole program as a docker image and execute it with parameters.
回答2:
If the objective is for the code to accept parameters instead of a hard-coded string for the table path, here is a way to achieve that:
- Add the table parameters as CustomOptions
- Inside your run function add the CustomOptions parameters as default string
...
class CustomOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--gcs_input_file_path',
type=str,
help='GCS Input File Path'
)
parser.add_value_provider_argument(
'--project_id',
type=str,
help='GCP ProjectID'
)
parser.add_value_provider_argument(
'--dataset',
type=str,
help='BigQuery DataSet Name'
)
parser.add_value_provider_argument(
'--table',
type=str,
help='BigQuery Table Name'
)
def run(argv=None):
pipeline_option = PipelineOptions()
pipeline = beam.Pipeline(options=pipeline_option)
custom_options = pipeline_option.view_as(CustomOptions)
pipeline_option.view_as(SetupOptions).save_main_session = True
pipeline_option.view_as(DebugOptions).experiments = ['use_beam_bq_sink']
parser = argparse.ArgumentParser()
parser.add_argument(
'--gcp_project_id',
type=str,
help='GCP ProjectID',
default=str(custom_options.project_id)
)
parser.add_argument(
'--dataset',
type=str,
help='BigQuery DataSet Name',
default=str(custom_options.dataset)
)
parser.add_argument(
'--table',
type=str,
help='BigQuery Table Name',
default=str(custom_options.table)
)
static_options, _ = parser.parse_known_args(argv)
path = static_options.gcp_project_id + ":" + static_options.dataset + "." + static_options.table
data = (
pipeline
| "Read from GCS Bucket" >>
beam.io.textio.ReadFromText(custom_options.gcs_input_file_path)
| "Parse Text File" >>
beam.ParDo(Split())
| 'WriteToBigQuery' >>
beam.io.WriteToBigQuery(
path,
schema=Schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
)
result = pipeline.run()
result.wait_until_finish()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
- Pass the table path at pipeline construction time in the shell file
python template.py \
--dataset dataset_name \
--table table_name \
--project project_name \
--runner DataFlowRunner \
--region region_name \
--staging_location gs://bucket_name/staging \
--temp_location gs://bucket_name/temp \
--template_location gs://bucket_name/templates/template_name
来源:https://stackoverflow.com/questions/62053420/calling-beam-io-writetobigquery-in-a-beam-dofn