问题
I am trying to stream PubSub message of type JSON to spanner database and the insert_update works very well. Spanner table has composite primary key, so need to delete the existing data before inserting new data from PubSub (so only latest data is present). Spanner replace or insert/update mutations does not work in this case. I added pipeline
import org.apache.beam.* ;
public class PubSubToSpannerPipeline {
// JSON to TableData Object
public static class PubSubToTableDataFn extends DoFn<String, TableData> {
@ProcessElement
public void processElement(ProcessContext c) {
.
.
.
}
}
public interface PubSubToSpannerOptions extends PipelineOptions, StreamingOptions {
.
.
.
}
public static void main(String[] args) {
PubSubToSpannerOptions options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.as(PubSubToSpannerOptions.class);
options.setStreaming(true);
SpannerConfig spannerConfig =
SpannerConfig.create()
.withProjectId(options.getProjectId())
.withInstanceId(options.getInstanceId())
.withDatabaseId(options.getDatabaseId());
Pipeline pipeLine = Pipeline.create(options);
PCollection<TableData> tableDataMsgs = pipeLine.apply(PubsubIO.readStrings()
.fromSubscription(options.getInputSubscription()))
.apply("ParsePubSubMessage", ParDo.of(new PubSubToTableDataFn ()));
// Window function
PCollection<TableData> tableDataJson = tableDataMsgs
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))));
PCollection<MutationGroup> upsertMutationGroup = tableDataJson.apply("TableDataMutation",
MapElements.via(new SimpleFunction<TableData, MutationGroup>() {
public MutationGroup apply(TableData input) {
String object_id = input.objectId;
pipeLine.apply("ReadExistingData", SpannerIO.read()
.withSpannerConfig(spannerConfig)
.withQuery("SELECT object_id, mapped_object_id, mapped_object_name from TableName where object_id ='" + object_id + "'")
.apply("MutationForExistingTableData",
ParDo.of(new DoFn<Struct, Mutation>(){
@ProcessElement
public void processElement(ProcessContext c) {
Struct str = c.element();
c.output(Mutation.delete("TableName", KeySet.newBuilder()
.addKey(Key.newBuilder()
.append(str.getString("object_id"))
.append(str.getString("mapped_object_id"))
.append(str.getString("mapped_object_name")).build()).build()));
}
} ))
.apply("DeleteExistingTableData", SpannerIO.write().withSpannerConfig(spannerConfig));
Mutation dataMutation = Mutation.newReplaceBuilder("TableName",
.
.
.
);
List<Mutation> list = new ArrayList<Mutation>();
List<Map<String, String>> mappingList = input.listOfObjectRows;
for (Map<String, String> objectMap : mappingList ) {
list.add(Mutation.newReplaceBuilder("TableName",
.
.
.);
}
return MutationGroup.create(dataMutation, list);
}
} )));
upsertMutationGroup.apply("WriteDataToSpanner", SpannerIO.write()
.withSpannerConfig(spannerConfig)
.grouped());
// Run the pipeline.
pipeLine.run().waitUntilFinish();
}
}
class TableData implements Serializable {
String objectId;
List<Map<String, String>> listOfObjectRows;
}
Expectation is existing mapping data must be deleted from table before insert or updating the data.
回答1:
I am not entirely sure what you are doing, but it looks like you want to:
- Read some existing data with a key (or partial key) matching the pubsub message
- Delete this data
- Insert new data from the pubsub message
One option is to create a DoFn
that performs this read/delete/insert, (or a read/update) within a read-write transaction. This will preserve the DB consistency...
Use the SpannerIO.WriteFn as a model - you need to set the SpannerAccessor
as transient and create/delete it in the @Setup
and @Teardown
event handlers
The @ProcessElement
handler of your DoFn
would create a Read-write Transaction, inside which you would reads the rows for the key, update or delete them and then inserts the new element(s).
The disadvantage of this method is that only one Pub/Sub message will be processed per Spanner transaction (unless you do something clever in a previous step such as grouping them), and this is a complex read-write transaction. If your messages/sec rate is relatively low this would be fine, but if not, this method would be putting a lot more load on your DB.
A second option is to use blind deletes of a key-range. This can only work if the object_id is the first part of the composite key (which it appears to be from your code).
You would create a MutationGroup
containing a delete mutation which blind-deletes any existing rows whose keys start with the object_id using a Delete mutation with a key-range, followed by insert mutations to replace the deleted rows.
MutationGroup.create(
// Delete rows with key starting with object_id.
Mutation.delete("TableName", KeySet.newBuilder()
.addRange(
KeyRange.closedClosed(
Key.of(str.getString("object_id")),
Key.of(str.getString("object_id"))))
.build()),
// Insert replacement rows.
Mutation.newInsertBuilder("TableName")
.set("column").to("value"),
...
.build(),
Mutation.newInsertBuilder("TableName")
...);
This would then be passed to SpannerIO.write().grouped() as before so that they can be batched for efficiency.
来源:https://stackoverflow.com/questions/58280581/pubsub-to-spanner-streaming-pipeline