Dataflow autoscale does not boost performance

问题

I'm building a Dataflow pipeline that reads from pubsub and sends requests to a 3rd party API. The pipeline use THROUGHPUT_BASED autoscaling.

However when I was doing a load test against it, after it autoscaled to 4 works to catch up with the backlog in pubsub, but it seems the same workload was spread out event between works, but overall throughput did not increase significantly.

^ Number of unacknowledged messages in pubsub. The peak is when traffic stopped going in

^ Bytes sent from each worker. The peak is the initial worker. As more workers were added to the pool, the workload is offloaded, instead of each of them picking up more workload. The CPU utilization looks the same, where the peak utilization is below 30% for the initial worker.

^ The history of worker spawned.

It feels like either there is a limitation being hit somewhere, but I have a hard time seeing what the limitation is. I was pulling less than 300 messages per second, and each message is about 1kb.

Update: I did another round of comparison between batched job using TextIO and streaming job using PubSubIO, both with "n1-standard-8" machines and fixed number of workers to 15. The batched job went up to 450 elements/s, but the streaming job still peaked at 230 elements/s. It seems the limitation is coming from the source. Although I'm not sure what was the limitation.

Update 2 Here is a simple code snippet to reproduce the issue. You will need to manually set number of works to 1 and 5 and compare the number of element processed by the pipeline. You will need a load tester to efficiently publish messages to the topic.

package debug;

import java.io.IOException;

import org.apache.beam.runners.dataflow.DataflowRunner;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.runners.dataflow.options.DataflowPipelineWorkerPoolOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;

public class DebugPipeline {
    @SuppressWarnings("serial")
    public static PipelineResult main(String[] args) throws IOException {

        /*******************************************
         * SETUP - Build options.
         ********************************************/

        DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
                .as(DataflowPipelineOptions.class);
        options.setRunner(DataflowRunner.class);
        options.setAutoscalingAlgorithm(
                DataflowPipelineWorkerPoolOptions.AutoscalingAlgorithmType.THROUGHPUT_BASED);
        // Autoscaling will scale between n/15 and n workers, so from 1-15 here
        options.setMaxNumWorkers(15);
        // Default of 250GB is absurdly high and we don't need that much on every worker
        options.setDiskSizeGb(32);
        // Manually configure scaling (i.e. 1 vs 5 for comparison)
        options.setNumWorkers(5);

        // Debug Pipeline
        Pipeline pipeline = Pipeline.create(options);
        pipeline
            .apply(PubsubIO.readStrings()
                    .fromSubscription("your subscription"))
            // this is the transform that I actually care about. In production code, this will
            // send a REST request to some 3rd party endpoint.
            .apply("sleep", ParDo.of(new DoFn<String, String>() {
                @ProcessElement
                public void processElement(ProcessContext c) throws InterruptedException {
                    Thread.sleep(500);
                    c.output(c.element());
                }
            }));

        return pipeline.run();
    }
}

回答1:

Taking into account that:

Switching from PubSubIO to TextIO showed no improvement.
Changing from 3 to 15 workers showed no improvement.
That batched jobs went up to 450elements/s but streaming peaked at 230elements/s
There is a transform that sends REST request to a 3rd party API, taking hours of wall time.
In a test, cancelling the transform increases throughput from 120elements/s to 400elements/s.

The issue doesn't seems to lie on PubSub side. According to this documentation you might be overloading the 3rd party API. The same effect is explained in documentation for clients, instead of 3rd party APIs:

It's possible that one client could have a backlog of messages because it doesn't have the capacity to process the volume of incoming messages, but another client on the network does have that capacity. The second client could reduce the overall backlog, but it doesn't get the chance to because the first client cannot send its messages to the second client quickly enough. This reduces the overall rate of processing because messages get stuck on the first client.

The messages that create a backlog consume memory, CPU, and bandwidth resources because the client library continues to extend the messages' acknowledgment deadline.

[...]

More generally, the need for flow control indicates that messages are being published at a higher rate than they are being consumed. If this is a persistent state, rather than a spike in message volume, consider increasing the number of subscriber client instances and machines.

If you can only work on PubSub to improve the results and you think that the way to achieve this is extending acknowledgement deadline time for elements, you can test it by accessing here and manually editting the subscription. To do it programmatically using Java, have a look on this and this documentation, about managing subscriptions and changing ackDeadlineSeconds respectively.

来源：https://stackoverflow.com/questions/51507709/dataflow-autoscale-does-not-boost-performance

标签

google-cloud-platform

google-cloud-dataflow

apache-beam

google-cloud-pubsub