Dataflow output parameterized type to avro file

前端 未结 1 1565
隐瞒了意图╮
隐瞒了意图╮ 2021-01-28 12:05

I have a pipeline that successfully outputs an Avro file as follows:

@DefaultCoder(AvroCoder.class)
class MyOutput_T_S {
  T foo;
  S bar;
  Boolean baz;
  publi         


        
相关标签:
1条回答
  • 2021-01-28 13:01

    I think there are two questions (correct me if I am wrong):

    1. How do I enable the coder registry to provide coders for various parameterizations of MyOutput<T, S>?
    2. How do I values of MyOutput<T, S> to a file using AvroIO.Write.

    The first question is to be solved by registering a CoderFactory as in the linked question you found.

    Your naive coder is probably allowing you to run the pipeline without issues because serialization is being optimized away. Certainly an Avro schema with no fields will result in those fields being dropped in a serialization+deserialization round trip.

    But assuming you fill in the schema with the fields, your approach to CoderFactory#create looks right. I don't know the exact cause of the message java.lang.IllegalArgumentException: Unable to get field id from class null, but the call to AvroCoder.of(MyOutput.class, schema) should work, for an appropriately assembled schema. If there is an issue with this, more details (such as the rest of the stack track) would be helpful.

    However, your override of CoderFactory#getInstanceComponents should return a list of values, one per type parameter of MyOutput. Like so:

    @Override
    public List<Object> getInstanceComponents(Object value) {
      MyOutput<Object, Object> myOutput = (MyOutput<Object, Object>) value;
      return ImmutableList.of(myOutput.foo, myOutput.bar);
    }
    

    The second question can be answered using some of the same support code as the first, but otherwise is independent. AvroIO.Write.withSchema always explicitly uses the provided schema. It does use AvroCoder under the hood, but this is actually an implementation detail. Providing a compatible schema is all that is necessary - such a schema will have to be composed for each value of T and S for which you want to output MyOutput<T, S>.

    0 讨论(0)
提交回复
热议问题