Reading avro format data in hadoop/map reduce

问题

I am trying to read avro format data in hadoop saved in hdfs. But most of the examples I have seen requires us to parse a schema to the job.. But I am not able to understand that requirement. I use pig and avro and I have never passed schema information.

So, I think I might be missing something. Basically, whats a good way to read avro files in hadoop mapreduce if I don't have schema information? Thanks

回答1:

You're right, Avro is pretty strict about knowing the type in advance. The only option I know of, if you have no idea the schema, is to read it as a GenericRecord. Here's a snippet of how to do that

public class MyMapper extends extends Mapper<AvroKey<GenericRecord>, NullWritable, ... > {
    @Override
    protected void map(AvroKey<GenericRecord> key, NullWritable value, Context context) throws IOException, InterruptedException {
        GenericRecord datum = key.datum();
        Schema schema = datum.getSchema();
        Object field1 = datam.get(0);
        Object someField = datam.get("someField");
        ...
    }
}

You won't have the nice getters and setters of course, since Java doesn't know know what type it is. The only getters available retrieve fields by either position or name. You'll have to cast the result to the type that you know the field to be. If you don't know, you'll have to have instanceof checks for every possibility, since Java is statically compiled (this is also why it's not as helpful as you might at first think that you have access to the schema).

But if you know the type it could be (or should be), you can call getSchema() on the class generated from avsc (that you expect your input to be), create a new instance of it, then map the fields one by one onto that new object from the GenericRecord. This would give you back access to the normal Avro methods. This gets more complicated of course when dealing with unions, nulls, and schema versioning.

来源：https://stackoverflow.com/questions/29979282/reading-avro-format-data-in-hadoop-map-reduce

标签

Hadoop

avro