Stateful indexing causes ParDo to be run single-threaded on Dataflow Runner

后端 未结 1 1570
不知归路
不知归路 2021-01-20 11:24

We\'re generating a sequential index in a ParDo using Beam\'s Java SDK 2.0.0. Just like the simple stateful index example in Beam\'s introduction to stateful processing we

1条回答
  •  一生所求
    2021-01-20 11:48

    This is not only the expected behavior of the Dataflow runner, but a logical necessity in any context. It doesn't matter if you are using state in Beam or an AtomicInteger in a single-process Java program: if operation "A" writes a value and operation "B" reads the value, then "B" must be executed after "A". The common term for this is relationship is "happens-before".

    This form of stateful computation is the opposite of parallel computation. By definition, a read that observes a write has a causal relationship. By definition, two operations that are in parallel do not have a causal relationship.

    Now, you are perhaps expecting parallel threads that access the state cell concurrently, as in the standard pattern of multi-threaded programming with some shared state with concurrency control. For this example, if these threads were actually parallel, you would get duplicate indices. Taking a step back, Beam targets massive "embarrassingly parallel" computations transparently distributed across a large cluster of machines. Fine-grained concurrency controls, aside from being extremely difficult to get right, do not readily translate to massive distributed computations.

    0 讨论(0)
提交回复
热议问题