Scio: groupByKey doesn't work when using Pub/Sub as collection source

人盡茶涼 提交于 2019-12-23 22:15:42

问题


I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work.

sc.pubsubSubscription[String](psSubscription)
  .withFixedWindows(windowSize) // apply windowing logic
  .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
  .countByValue
  .withWindow[IntervalWindow]
  .swap
  .groupByKey
  .map {
    s =>
      println("\n\n\n\n\n\n\n This never prints \n\n\n\n\n")
      println(s)
  }

回答1:


Changing the input from a text file to PubSub the PCollection "unbounded". Grouping that by key requires to define aggregation triggers, otherwise the grouper will wait forever. It's mentioned in the dataflow documentation here: https://cloud.google.com/dataflow/model/group-by-key

Note: Either non-global Windowing or an aggregation trigger is required in order to perform a GroupByKey on an unbounded PCollection. This is because a bounded GroupByKey must wait for all the data with a certain key to be collected; but with an unbounded collection, the data is unlimited. Windowing and/or Triggers allow grouping to operate on logical, finite bundles of data within the unbounded data stream.

If you apply GroupByKey to an unbounded PCollection without setting either a non-global windowing strategy, a trigger strategy, or both, Dataflow will generate an IllegalStateException error when your pipeline is constructed.

Unfortunately, in the Python SDK of Apache Beam seems not to support triggers (yet), so I'm not sure what the solution would be in python.

(see https://beam.apache.org/documentation/programming-guide/#triggers)




回答2:


With regards to Franz's comment above (I would reply to his comment specifically if StackOverflow would let me!,) I see that the docs say that triggering is not implemented... but they also say that Realtime Database functions are not available, while our current project is actively using them. They're just new.

See trigger functions here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/trigger.py

Beware, the API is unfinished as this is not "release-ready" code. But it is available.



来源:https://stackoverflow.com/questions/44628370/scio-groupbykey-doesnt-work-when-using-pub-sub-as-collection-source

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!