I am trying to execute a dataflow pipeline job which would execute one function on N entries at a time from datastore. In my case this function is sending b
You can batch these elements up within your DoFn
. For example:
final int BATCH_SIZE = 100;
pipeline
// 1. Read input from datastore
.apply(DatastoreIO.readFrom(datasetId, query))
// 2. Programatically batch users
.apply(ParDo.of(new DoFn<DatastoreV1.Entity, Iterable<EntryPOJO>>() {
private final List<EntryPOJO> accumulator = new ArrayList<>(BATCH_SIZE);
@Override
public void processElement(ProcessContext c) throws Exception {
EntryPOJO entry = processEntity(c);
accumulator.add(c);
if (accumulator.size() >= BATCH_SIZE) {
c.output(accumulator);
accumulator = new ArrayList<>(BATCH_SIZE);
}
}
@Override
public void finishBundle(Context c) throws Exception {
if (accumulator.size() > 0) {
c.output(accumulator);
}
}
});
// 3. Consume those bundles
.apply(ParDo.of(new DoFn<Iterable<EntryPOJO>, Object>() {
@Override
public void processElement(ProcessContext c) throws Exception {
sendToRESTEndpoint(batchedEntries);
}
}));
You could also combine steps 2 and 3 in a single DoFn
if you didn't want the separate "batching" step.