问题
According to release notes of dataflow 2.X, IntraBundleParallelization is removed. Is there a way to control/increase parallelism of DoFns on dataflow 2.1.0 ?
I was getting better performance when I used IntrabundleParallelization on 1.9.0 version of dataflow.
回答1:
It was removed because its implementation keeps a handle on the ProcessContext
of a ProcessElement
call after the call completes, and this is unsafe and not guaranteed to work.
However, I agree that it was a useful abstraction, and it is unfortunate that we don't have a replacement yet.
As a workaround, you can try the following:
- In your DoFn's
@Setup
, create anExecutor
with the needed number of threads - In your DoFn's
@StartBundle
, create anExecutorCompletionService
wrapping the executor - In
@ProcessElement
, submit aFuture
to it representing the result of processing the element - In
@ProcessElement
, alsopoll()
theCompletionService
for completed futures and output their results - In
@FinishBundle
, wait for all remaining futures to complete, output their results, and shut down theCompletionService
.
Remember to not use the ProcessContext
in your futures. ProcessContext
can only be used from the current thread and from within the current ProcessElement
call.
来源:https://stackoverflow.com/questions/47023871/is-there-an-alternative-to-intrabundleparallelization-in-dataflow-2-1-0