Why increments are not supported in Dataflow-BigTable connector?

心已入冬 提交于 2020-05-13 08:12:11

问题


We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know for sure the increment was called only-once? I want to understand what part I am missing.

Also, is CloudBigTableIO usable in Streaming mode or is it tied to Batch mode only? I guess we could use the BigTable HBase client directly in the pipeline but the connector seems to have nice properties like Connection-pooling which we would like to leverage and hence the question.


回答1:


The way that Dataflow (and other systems) offer the appearence of exactly-once execution in the presence of failures and retries is by requiring that side-effects (such as mutating BigTable) are idempotent. A "write" is idempotent because it is overwritten on retry. Inserts can be made idempotent by including a deterministic "insert ID" that deduplicates the insert.

For an increment, that is not the case. It is not supported because it would not be idempotent when retried, so it would not support exactly-once execution.




回答2:


CloudBigTableIO is usable in streaming mode. We had to implement a DoFn rather than a Sink in order to support that via the Dataflow SDK.



来源:https://stackoverflow.com/questions/43854923/why-increments-are-not-supported-in-dataflow-bigtable-connector

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!