Can datastore input in google dataflow pipeline be processed in a batch of N entries at a time?

后端 未结 1 525
忘了有多久
忘了有多久 2020-12-06 19:20

I am trying to execute a dataflow pipeline job which would execute one function on N entries at a time from datastore. In my case this function is sending b

相关标签:
1条回答
  • 2020-12-06 20:16

    You can batch these elements up within your DoFn. For example:

    final int BATCH_SIZE = 100;
    
    pipeline
      // 1. Read input from datastore  
      .apply(DatastoreIO.readFrom(datasetId, query))
    
      // 2. Programatically batch users
      .apply(ParDo.of(new DoFn<DatastoreV1.Entity, Iterable<EntryPOJO>>() {
    
        private final List<EntryPOJO> accumulator = new ArrayList<>(BATCH_SIZE);
    
        @Override
        public void processElement(ProcessContext c) throws Exception {
          EntryPOJO entry = processEntity(c);
          accumulator.add(c);
          if (accumulator.size() >= BATCH_SIZE) {
            c.output(accumulator);
            accumulator = new ArrayList<>(BATCH_SIZE);
          }
        }
    
        @Override
        public void finishBundle(Context c) throws Exception {
          if (accumulator.size() > 0) {
            c.output(accumulator);
          }
        }
      });
    
      // 3. Consume those bundles
      .apply(ParDo.of(new DoFn<Iterable<EntryPOJO>, Object>() {
        @Override
        public void processElement(ProcessContext c) throws Exception {
            sendToRESTEndpoint(batchedEntries);
        }
      }));
    

    You could also combine steps 2 and 3 in a single DoFn if you didn't want the separate "batching" step.

    0 讨论(0)
提交回复
热议问题