PubSub to Spanner Streaming Pipeline

问题

I am trying to stream PubSub message of type JSON to spanner database and the insert_update works very well. Spanner table has composite primary key, so need to delete the existing data before inserting new data from PubSub (so only latest data is present). Spanner replace or insert/update mutations does not work in this case. I added pipeline


import org.apache.beam.* ;

public class PubSubToSpannerPipeline {

  // JSON to TableData Object
  public static class PubSubToTableDataFn extends DoFn<String, TableData> {

    @ProcessElement
    public void processElement(ProcessContext c) {
      .
      .
      .
    }
  }

  public interface PubSubToSpannerOptions extends PipelineOptions, StreamingOptions {
    .
    .
    .
  }

  public static void main(String[] args) {
    PubSubToSpannerOptions options = PipelineOptionsFactory
        .fromArgs(args)
        .withValidation()
        .as(PubSubToSpannerOptions.class);
    options.setStreaming(true);

    SpannerConfig spannerConfig =
        SpannerConfig.create()
        .withProjectId(options.getProjectId())
        .withInstanceId(options.getInstanceId())
        .withDatabaseId(options.getDatabaseId());

    Pipeline pipeLine = Pipeline.create(options);

    PCollection<TableData> tableDataMsgs = pipeLine.apply(PubsubIO.readStrings()
        .fromSubscription(options.getInputSubscription()))
        .apply("ParsePubSubMessage", ParDo.of(new PubSubToTableDataFn ()));

    // Window function
    PCollection<TableData> tableDataJson = tableDataMsgs
        .apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))));

    PCollection<MutationGroup> upsertMutationGroup = tableDataJson.apply("TableDataMutation",
        MapElements.via(new SimpleFunction<TableData, MutationGroup>() {

          public MutationGroup apply(TableData input) {

            String object_id = input.objectId;

            pipeLine.apply("ReadExistingData", SpannerIO.read()
                .withSpannerConfig(spannerConfig)
                .withQuery("SELECT object_id, mapped_object_id, mapped_object_name from TableName where object_id ='" + object_id + "'")
            .apply("MutationForExistingTableData", 
                    ParDo.of(new DoFn<Struct, Mutation>(){
                      @ProcessElement
                      public void processElement(ProcessContext c) {
                        Struct str = c.element();
                        c.output(Mutation.delete("TableName", KeySet.newBuilder()
                            .addKey(Key.newBuilder()
                                .append(str.getString("object_id"))
                                .append(str.getString("mapped_object_id"))
                                .append(str.getString("mapped_object_name")).build()).build()));
                      }
                    } ))
            .apply("DeleteExistingTableData", SpannerIO.write().withSpannerConfig(spannerConfig));

              Mutation dataMutation = Mutation.newReplaceBuilder("TableName",
                  .
                  .
                  .

                  );
              List<Mutation> list = new ArrayList<Mutation>();


              List<Map<String, String>> mappingList = input.listOfObjectRows;

              for (Map<String, String> objectMap : mappingList ) {
                list.add(Mutation.newReplaceBuilder("TableName",
                    .
                    .
                    .);
              }     

              return MutationGroup.create(dataMutation, list);


          }
        } )));


        upsertMutationGroup.apply("WriteDataToSpanner", SpannerIO.write()
            .withSpannerConfig(spannerConfig)
            .grouped());

        // Run the pipeline.
        pipeLine.run().waitUntilFinish();
  }

}

class TableData implements Serializable {
  String objectId;
  List<Map<String, String>> listOfObjectRows;

}

Expectation is existing mapping data must be deleted from table before insert or updating the data.

回答1:

I am not entirely sure what you are doing, but it looks like you want to:

Read some existing data with a key (or partial key) matching the pubsub message
Delete this data
Insert new data from the pubsub message

One option is to create a DoFn that performs this read/delete/insert, (or a read/update) within a read-write transaction. This will preserve the DB consistency...

Use the SpannerIO.WriteFn as a model - you need to set the SpannerAccessor as transient and create/delete it in the @Setup and @Teardown event handlers

The @ProcessElement handler of your DoFn would create a Read-write Transaction, inside which you would reads the rows for the key, update or delete them and then inserts the new element(s).

The disadvantage of this method is that only one Pub/Sub message will be processed per Spanner transaction (unless you do something clever in a previous step such as grouping them), and this is a complex read-write transaction. If your messages/sec rate is relatively low this would be fine, but if not, this method would be putting a lot more load on your DB.

A second option is to use blind deletes of a key-range. This can only work if the object_id is the first part of the composite key (which it appears to be from your code).

You would create a MutationGroup containing a delete mutation which blind-deletes any existing rows whose keys start with the object_id using a Delete mutation with a key-range, followed by insert mutations to replace the deleted rows.

MutationGroup.create(
    // Delete rows with key starting with object_id.
    Mutation.delete("TableName", KeySet.newBuilder()
        .addRange(
            KeyRange.closedClosed(
                Key.of(str.getString("object_id")),
                Key.of(str.getString("object_id"))))
        .build()),
    // Insert replacement rows.
    Mutation.newInsertBuilder("TableName")
        .set("column").to("value"),
        ...
        .build(),
    Mutation.newInsertBuilder("TableName")
        ...);

This would then be passed to SpannerIO.write().grouped() as before so that they can be batched for efficiency.

来源：https://stackoverflow.com/questions/58280581/pubsub-to-spanner-streaming-pipeline

标签

google-cloud-dataflow

google-cloud-pubsub

google-cloud-spanner