I\'m doing a simple pipeline using Apache Beam in python (on GCP Dataflow) to read from PubSub and write on Big Query but can\'t handle exceptions on pipeline to create alte
You can also use the generator flavor of FlatMap
:
This is similar to the other answer, in that you can use a DoFn
in the place of something else, e.g. a CombineFn
to produce no outputs when there is an exception or other kind of failed-preconditions.
def sum_values(values: List[int]) -> Generator[int, None, None]:
if not values or len(values) < 10:
logging.error(f'received invalid inputs: {...}')
return
yield sum(values)
# Now instead of use |CombinePerKey|
(inputs
| 'WithKey' >> beam.Map(lambda x: (x.key, x)) \
| 'GroupByKey' >> beam.GroupByKey() \
| 'Values' >> beam.Values() \
| 'MaybeSum' >> beam.FlatMap(sum_values))
I've been only able to catch exceptions at the DoFn
level, so something like this:
class MyPipelineStep(beam.DoFn):
def process(self, element, *args, **kwargs):
try:
# do stuff...
yield pvalue.TaggedOutput('main_output', output_element)
except Exception as e:
yield pvalue.TaggedOutput('exception', str(e))
However WriteToBigQuery
is PTransform
that wraps the DoFn
BigQueryWriteFn
So you may need to do something like this
class MyBigQueryWriteFn(BigQueryWriteFn):
def process(self, *args, **kwargs):
try:
return super(BigQueryWriteFn, self).process(*args, **kwargs)
except Exception as e:
# Do something here
class MyWriteToBigQuery(WriteToBigQuery):
# Copy the source code of `WriteToBigQuery` here,
# but replace `BigQueryWriteFn` with `MyBigQueryWriteFn`
https://beam.apache.org/releases/pydoc/2.9.0/_modules/apache_beam/io/gcp/bigquery.html#WriteToBigQuery