avro

optional array in avro schema

南楼画角 提交于 2019-12-21 03:52:52
问题 I'm wondering whether or not it is possible to have an optional array. Let's assume a schema like this: { "type": "record", "name": "test_avro", "fields" : [ {"name": "test_field_1", "type": "long"}, {"name": "subrecord", "type": [{ "type": "record", "name": "subrecord_type", "fields":[{"name":"field_1", "type":"long"}] },"null"] }, {"name": "simple_array", "type":{ "type": "array", "items": "string" } } ] } Trying to write an avro record without "simple_array" would result in a NPE in the

Apache Kafka and Avro: org.apache.avro.generic.GenericData$Record cannot be cast to com.harmeetsingh13.java.Customer

主宰稳场 提交于 2019-12-21 02:51:17
问题 Whenever I am trying to read the message from kafka queue, I am getting following exception : [error] (run-main-0) java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to com.harmeetsingh13.java.Customer java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to com.harmeetsingh13.java.Customer at com.harmeetsingh13.java.consumers.avrodesrializer.AvroSpecificDeserializer.infiniteConsumer(AvroSpecificDeserializer.java:79) at

Concat Avro files using avro-tools

眉间皱痕 提交于 2019-12-20 20:39:03
问题 Im trying to merge avro files into one big file, the problem is concat command does not accept the wildcard hadoop jar avro-tools.jar concat /input/part* /output/bigfile.avro I get: Exception in thread "main" java.io.FileNotFoundException: File does not exist: /input/part* I tried to use "" and '' but no chance. 回答1: I quickly checked Avro's source code (1.7.7) and it seems that concat does not support glob patterns (basically, they call FileSystem.open() on each argument except the last one)

Get a typed value from an Avro GenericRecord

佐手、 提交于 2019-12-20 09:56:07
问题 Given a GenericRecord, what is the recommended way to retrieve a typed value, as opposed to an Object? Are we expected to cast the values, and if so what is the mapping from Avro types to Java types? For example, Avro Array == Java Collection ; and Avro String == Java Utf8. Since every GenericRecord contains its schema, I was hoping for a type-safe way to retrieve values. 回答1: Avro has eight primitive types and five complex types (excluding unions which are a combination of other types). The

How can I load Avros in Spark using the schema on-board the Avro file(s)?

落花浮王杯 提交于 2019-12-20 09:42:04
问题 I am running CDH 4.4 with Spark 0.9.0 from a Cloudera parcel. I have a bunch of Avro files that were created via Pig's AvroStorage UDF. I want to load these files in Spark, using a generic record or the schema onboard the Avro files. So far I've tried this: import org.apache.avro.mapred.AvroKey import org.apache.avro.mapreduce.AvroKeyInputFormat import org.apache.hadoop.io.NullWritable import org.apache.commons.lang.StringEscapeUtils.escapeCsv import org.apache.hadoop.fs.Path import org

Avro schema for Json array

。_饼干妹妹 提交于 2019-12-20 05:37:28
问题 Suppose I have following json: [ {"id":1,"text":"some text","user_id":1}, {"id":1,"text":"some text","user_id":2}, ... ] What would be an appropriate avro schema for this array of objects? 回答1: [short answer] The appropriate avro schema for this array of objects would look like: const type = avro.Type.forSchema({ type: 'array', items: { type: 'record', fields: [ { name: 'id', type: 'int' }, { name: 'text', type: 'string' }, { name: 'user_id', type: 'int' } ] } }); [long answer] We can use

Parquet Data timestamp columns INT96 not yet implemented in Druid Overlord Hadoop task

半城伤御伤魂 提交于 2019-12-20 03:43:25
问题 Context: I am able to submit a MapReduce job from druid overlord to an EMR. My Data source is in S3 in Parquet format. I have a timestamp column (INT96) in parquet data which is not supported in Avroschema. Error is while parsing the timestamp Issue Stack trace is: Error: java.lang.IllegalArgumentException: INT96 not yet implemented. at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279) at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96

Data serialization framework

左心房为你撑大大i 提交于 2019-12-19 19:29:36
问题 I'm new to this Apache Avro(serialization framework). I know what serialization is but why there are separate frameworks lik avro, thrift, protocol buffers and Why cant we use java serialization api's instead of these separate frameworks, are there any flaws in java serializatio api's. What is the meaning of below phrase "does not require running a code-generation program when a schema changes" in avro or in any other serializatio framework. Please help me to understand all these!! 回答1: Why

Spark: Writing to Avro file

眉间皱痕 提交于 2019-12-19 05:07:35
问题 I am in Spark, I have an RDD from an Avro file. I now want to do some transformations on that RDD and save it back as an Avro file: val job = new Job(new Configuration()) AvroJob.setOutputKeySchema(job, getOutputSchema(inputSchema)) rdd.map(elem => (new SparkAvroKey(doTransformation(elem._1)), elem._2)) .saveAsNewAPIHadoopFile(outputPath, classOf[AvroKey[GenericRecord]], classOf[org.apache.hadoop.io.NullWritable], classOf[AvroKeyOutputFormat[GenericRecord]], job.getConfiguration) When running

How to read and write Map<String, Object> from/to parquet file in Java or Scala?

你。 提交于 2019-12-18 19:06:11
问题 Looking for a concise example on how to read and write Map<String, Object> from/to parquet file in Java or Scala? Here is expected structure, using com.fasterxml.jackson.databind.ObjectMapper as serializer in Java (i.e. looking for equivalent using parquet): public static Map<String, Object> read(InputStream inputStream) throws IOException { ObjectMapper objectMapper = new ObjectMapper(); return objectMapper.readValue(inputStream, new TypeReference<Map<String, Object>>() { }); } public static