Apache Beam + Dataflow too slow for only 18k data

问题

we need to execute heavy calculation on simple but numerous data.
Input data are rows in a BigQuery table, two columns: ID (Integer) and DATA (STRING). The DATA values are of the form "1#2#3#4#..." with 36 values.
Ouput data are the same form, but DATA are just transformed by an algorithm.
It's a "one for one" transformation.

We have tried Apache Beam with Google Cloud Dataflow, but it does not work, there are errors as soon as several workers are instancied.
For our POC we use only 18k input rows, the target is about 1 million.

Here is a light version of the class (I've removed the write part, the behaviour remains the same):

public class MyClass {

static MyService myService = new MyService();

static class ExtractDataFn extends DoFn<TableRow, KV<Long, String>> {
    @ProcessElement
    public void processElement(ProcessContext c) {
        Long id = Long.parseLong((String) c.element().get("ID"));  
        String data = (String) c.element().get("DATA");         
        c.output(KV.of(id, data));
    }
}

public interface Options extends PipelineOptions {
    String getInput();
    void setInput(String value);

    @Default.Enum("EXPORT")
    TypedRead.Method getReadMethod();
    void setReadMethod(TypedRead.Method value);

    @Validation.Required
    String getOutput();
    void setOutput(String value);
}

static void run(Options options) {
    Pipeline p = Pipeline.create(options);

    List<TableFieldSchema> fields = new ArrayList<>();
    fields.add(new TableFieldSchema().setName("ID").setType("INTEGER"));
    fields.add(new TableFieldSchema().setName("DATA").setType("STRING"));
    TableSchema schema = new TableSchema().setFields(fields);

    PCollection<TableRow> rowsFromBigQuery = p.apply(
            BigQueryIO.readTableRows().from(options.getInput()).withMethod(options.getReadMethod())
    );              
    
    PCollection<KV<Long, String>> inputdata = rowsFromBigQuery.apply(ParDo.of(new ExtractDataFn()));
    PCollection<KV<Long, String>> outputData = applyTransform(inputdata);
    // Here goes the part where data are written in a BQ table
    p.run().waitUntilFinish();
}

static PCollection<KV<Long, String>> applyTransform(PCollection<KV<Long, String>> inputData) {      
    PCollection<KV<Long, String>> forecasts = inputData.apply(ParDo.of(new DoFn<KV<Long, String>, KV<Long, String>> () {
                    
        @ProcessElement
        public void processElement(@Element KV<Long, String> element, OutputReceiver<KV<Long, String>> receiver, ProcessContext c) {
            MyDto dto = new MyDto();
            List<Double> inputData = Arrays.asList(element.getValue().split("#")).stream().map(Double::valueOf).collect(Collectors.toList());
            dto.setInputData(inputData);                
            dto = myService.calculate(dto); // here is the time consuming operation
            String modifiedData = dto.getModifiedData().stream().map(Object::toString).collect(Collectors.joining(","));
            receiver.output(KV.of(element.getKey(), modifiedData));
        }
      }))
    ;
    return forecasts;
}

public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    run(options);
}

}

In the GCP Logs console we can see the number of workers increasing up to 10, during about 5 minutes, it decreases to 3 or 4, and then we have this kind of messages (several hundreds of them), and CPU is about 0%:

Proposing dynamic split of work unit myproject;2020-10-06_06_18_27-12689839210406435299;1231063355075246317 at {"fractionConsumed":0.5,"position":{"shufflePosition":"f_8A_wD_AAAB"}}

and

Operation ongoing in step BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Read for at least 05m00s without outputting or completing in state read-shuffle at app//org.apache.beam.runners.dataflow.worker.ApplianceShuffleReader.readIncludingPosition(Native Method)

If we let it run it finishes in error of this kind :

Error message from worker: java.lang.RuntimeException: unexpected org.apache.beam.runners.dataflow.worker.util.common.worker.CachingShuffleBatchReader.read(CachingShuffleBatchReader.java:77)

If I modify the myService.calculate method to be faster, all the data are treated by only one worker and there is no problem. The problem seems to occured only when treatments are parallelized.

Thank you for your help