Schema update while writing to Avro files

空扰寡人 提交于 2020-02-06 08:47:09

问题


Context: We have a Dataflow job that transforms PubSub messages into Avro GenericRecords and writes them into GCS as ".avro". The transformation between PubSub messages and GenericRecords requires a schema. This schema changes weekly with field additions only. We want to be able to update the fields without updating the Dataflow job.

What we did: We took the advice from this post and created a Guava Cache that refreshes the content every minute. The refresh function will pull schema from GCS. We then have FileIO.write query the Guava Cache to get the latest schema and transforms the elements with the schema as GenericRecord. We also have FileIO.write outputs to an Avro sink which is also created using the schema.

Code is as follows:

genericRecordsAsByteArrays.apply(FileIO.<byte[]>write()
    .via(fn((input, c) -> {
          Map<String, Object> schemaInfo = cache.get("");
          Descriptors.Descriptor paymentRecordFd =
              (Descriptors.Descriptor) schemaInfo.get(DESCRIPTOR_KEY);
          DynamicMessage paymentRecordMsg = DynamicMessage.parseFrom(paymentRecordFd, input);
          Schema schema = (Schema) schemaInfo.get(SCHEMA_KEY);

          //From concrete PaymentRecord bytes to DynamicMessage
          try (ByteArrayOutputStream output = new ByteArrayOutputStream()) {
            BinaryEncoder encoder = EncoderFactory.get().directBinaryEncoder(output, null);
            ProtobufDatumWriter<DynamicMessage> pbWriter = new ProtobufDatumWriter<>(schema);
            pbWriter.write(paymentRecordMsg, encoder);
            encoder.flush();

            // From dynamic message to GenericRecord
            byte[] avroContents = output.toByteArray();
            DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);
            BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(avroContents, null);
            return reader.read(null, decoder);
          }
        }, requiresSideInputs()),
        fn((output, c) -> {
          Map<String, Object> schemaInfo = cache.get("");
          Schema schema = (Schema) schemaInfo.get(SCHEMA_KEY);
          return AvroIO.sink(schema).withCodec(CodecFactory.snappyCodec());
        }, requiresSideInputs()))
    .withNumShards(5)
    .withNaming(new PerWindowFilenames(baseDir, ".avro"))
    .to(baseDir.toString()));

My questions:

  1. What's gonna happen when we are writing to one Avro file, but all of a sudden the schema update happens and now we are writing the new schema into an Avro file created with the old schema?
  2. Does Dataflow start a new file when it sees a new schema?
  3. Does Dataflow ignore the new schema and the additional fields until a new file is created?

Each Avro file has its own schema at the very beginning of the file, so I am not sure what's the expected behavior.


回答1:


now we are writing the new schema into an Avro file created with the old schema

It's not possible. Each Avro file only has one schema. If it changes, by definition, you'd be writing to a new file.

I doubt Dataflow ignores fields.



来源:https://stackoverflow.com/questions/59903206/schema-update-while-writing-to-avro-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!