How do I perform a “diff” on two Sources given a key using Apache Beam Python SDK?

萝らか妹 提交于 2020-01-15 07:27:12

问题


I posed the question generically, because maybe it is a generic answer. But a specific example is comparing 2 BigQuery tables with the same schema, but potentially different data. I want a diff, i.e. what was added, deleted, modified, with respect to a composite key, e.g. the first 2 columns.

Table A
C1  C2  C3
-----------
a   a   1
a   b   1
a   c   1

Table B     
C1  C2  C3  # Notes if comparing B to A
-------------------------------------
a   a   1   # No Change to the key a + a
a   b   2   # Key a + b Changed from 1 to 2
            # Deleted key a + c with value 1
a   d   1   # Added key a + d

I basically want to be able to make/report the comparison notes. Or from a Beam perspective I may want to Just output up to 4 labeled PCollections: Unchanged, Changed, Added, Deleted. How do I do this and what would the PCollections look like?


回答1:


What you want to do here, basically, is join two tables and compare the result of that, right? You can look at my answer to this question, to see the two ways in which you can join two tables (Side inputs, or CoGroupByKey).

I'll also code a solution for your problem using CoGroupByKey. I'm writing the code in Python because I'm more familiar with the Python SDK, but you'd implement similar logic in Java:

def make_kv_pair(x):
  """ Output the record with the x[0]+x[1] key added."""
  return ((x[0], x[1]), x)

table_a = (p | 'ReadTableA' >> beam.Read(beam.io.BigQuerySource(....))
            | 'SetKeysA' >> beam.Map(make_kv_pair)
table_b = (p | 'ReadTableB' >> beam.Read(beam.io.BigQuerySource(....))
            | 'SetKeysB' >> beam.Map(make_kv_pair))

joined_tables = ({'table_a': table_a, 'table_b': table_b}
                 | beam.CoGroupByKey())


output_types = ['changed', 'added', 'deleted', 'unchanged']
class FilterDoFn(beam.DoFn):
  def process((key, values)):
    table_a_value = list(values['table_a'])
    table_b_value = list(values['table_b'])
    if table_a_value == table_b_value:
      yield pvalue.TaggedOutput('unchanged', key)
    elif len(table_a_value) < len(table_b_value):
      yield pvalue.TaggedOutput('added', key)
    elif len(table_a_value) > len(table_b_value):
      yield pvalue.TaggedOutput('removed', key)
    elif table_a_value != table_b_value:
      yield pvalue.TaggedOutput('changed', key)

key_collections = (joined_tables 
                   | beam.ParDo(FilterDoFn()).with_outputs(*output_types))

# Now you can handle each output
key_collections.unchanged | WriteToText(...)
key_collections.changed | WriteToText(...)
key_collections.added | WriteToText(...)
key_collections.removed | WriteToText(...)


来源:https://stackoverflow.com/questions/45873830/how-do-i-perform-a-diff-on-two-sources-given-a-key-using-apache-beam-python-sd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!