Reading avro format data in hadoop/map reduce

风格不统一 提交于 2019-12-24 21:44:14

问题


I am trying to read avro format data in hadoop saved in hdfs. But most of the examples I have seen requires us to parse a schema to the job.. But I am not able to understand that requirement. I use pig and avro and I have never passed schema information.

So, I think I might be missing something. Basically, whats a good way to read avro files in hadoop mapreduce if I don't have schema information? Thanks


回答1:


You're right, Avro is pretty strict about knowing the type in advance. The only option I know of, if you have no idea the schema, is to read it as a GenericRecord. Here's a snippet of how to do that

public class MyMapper extends extends Mapper<AvroKey<GenericRecord>, NullWritable, ... > {
    @Override
    protected void map(AvroKey<GenericRecord> key, NullWritable value, Context context) throws IOException, InterruptedException {
        GenericRecord datum = key.datum();
        Schema schema = datum.getSchema();
        Object field1 = datam.get(0);
        Object someField = datam.get("someField");
        ...
    }
}

You won't have the nice getters and setters of course, since Java doesn't know know what type it is. The only getters available retrieve fields by either position or name. You'll have to cast the result to the type that you know the field to be. If you don't know, you'll have to have instanceof checks for every possibility, since Java is statically compiled (this is also why it's not as helpful as you might at first think that you have access to the schema).

But if you know the type it could be (or should be), you can call getSchema() on the class generated from avsc (that you expect your input to be), create a new instance of it, then map the fields one by one onto that new object from the GenericRecord. This would give you back access to the normal Avro methods. This gets more complicated of course when dealing with unions, nulls, and schema versioning.



来源:https://stackoverflow.com/questions/29979282/reading-avro-format-data-in-hadoop-map-reduce

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!