问题
Context: We have a Dataflow job that transforms PubSub messages into Avro GenericRecords and writes them into GCS as ".avro". The transformation between PubSub messages and GenericRecords requires a schema. This schema changes weekly with field additions only. We want to be able to update the fields without updating the Dataflow job.
What we did: We took the advice from this post and created a Guava Cache that refreshes the content every minute. The refresh function will pull schema from GCS. We then have FileIO.write query the Guava Cache to get the latest schema and transforms the elements with the schema as GenericRecord. We also have FileIO.write outputs to an Avro sink which is also created using the schema.
Code is as follows:
genericRecordsAsByteArrays.apply(FileIO.<byte[]>write()
.via(fn((input, c) -> {
Map<String, Object> schemaInfo = cache.get("");
Descriptors.Descriptor paymentRecordFd =
(Descriptors.Descriptor) schemaInfo.get(DESCRIPTOR_KEY);
DynamicMessage paymentRecordMsg = DynamicMessage.parseFrom(paymentRecordFd, input);
Schema schema = (Schema) schemaInfo.get(SCHEMA_KEY);
//From concrete PaymentRecord bytes to DynamicMessage
try (ByteArrayOutputStream output = new ByteArrayOutputStream()) {
BinaryEncoder encoder = EncoderFactory.get().directBinaryEncoder(output, null);
ProtobufDatumWriter<DynamicMessage> pbWriter = new ProtobufDatumWriter<>(schema);
pbWriter.write(paymentRecordMsg, encoder);
encoder.flush();
// From dynamic message to GenericRecord
byte[] avroContents = output.toByteArray();
DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(avroContents, null);
return reader.read(null, decoder);
}
}, requiresSideInputs()),
fn((output, c) -> {
Map<String, Object> schemaInfo = cache.get("");
Schema schema = (Schema) schemaInfo.get(SCHEMA_KEY);
return AvroIO.sink(schema).withCodec(CodecFactory.snappyCodec());
}, requiresSideInputs()))
.withNumShards(5)
.withNaming(new PerWindowFilenames(baseDir, ".avro"))
.to(baseDir.toString()));
My questions:
- What's gonna happen when we are writing to one Avro file, but all of a sudden the schema update happens and now we are writing the new schema into an Avro file created with the old schema?
- Does Dataflow start a new file when it sees a new schema?
- Does Dataflow ignore the new schema and the additional fields until a new file is created?
Each Avro file has its own schema at the very beginning of the file, so I am not sure what's the expected behavior.
回答1:
now we are writing the new schema into an Avro file created with the old schema
It's not possible. Each Avro file only has one schema. If it changes, by definition, you'd be writing to a new file.
I doubt Dataflow ignores fields.
来源:https://stackoverflow.com/questions/59903206/schema-update-while-writing-to-avro-files