avro

Reading avro format data in hadoop/map reduce

风格不统一 提交于 2019-12-24 21:44:14
问题 I am trying to read avro format data in hadoop saved in hdfs. But most of the examples I have seen requires us to parse a schema to the job.. But I am not able to understand that requirement. I use pig and avro and I have never passed schema information. So, I think I might be missing something. Basically, whats a good way to read avro files in hadoop mapreduce if I don't have schema information? Thanks 回答1: You're right, Avro is pretty strict about knowing the type in advance. The only

Avro generated class issue with json conversion [kotlin]

半世苍凉 提交于 2019-12-24 21:28:37
问题 I'm having a strange issue with marshalling/unmarshalling an avro generated class. The error I'm getting is throwing a not an enum error - except there aren't any enum's in my class. The error is specifically this: com.fasterxml.jackson.databind.JsonMappingException: Not an enum: {"type":"record","name":"TimeUpdateTopic","namespace":"org.company.mmd.time","fields":[{"name":"time","type":"double"}]} (through reference chain: org.company.mmd.time.TimeUpdateTopic["schema"]->org.apache.avro

How to use kafka schema management and Avro for breaking changes

半腔热情 提交于 2019-12-24 21:21:04
问题 kafka schema management with avro give us flexibility to backward compatibility but how do we handle breaking-changes in the scheme? Assume Producer A publish messages M to Consumer C assume message M has a breaking change in it's scheme (e.g name field is now splitted into first_name and last_name) and we have new scheme M-New Now we are deploying producer A-New and Consumer C-New problem is that until our deployment process finish we can have Producer A-new publish message M-new where

Cast numeric fields with kafka connect and table.whitelist

元气小坏坏 提交于 2019-12-24 20:47:54
问题 I have created a source and a sink connector for kafka connect Confluent 5.0, to push two sqlserver tables to my datalake Here is my SQLServer table schema : CREATE TABLE MYBASE.dbo.TABLE1 ( id_field int IDENTITY(1,1) NOT NULL, my_numericfield numeric(24,6) NULL, time_field smalldatetime NULL, CONSTRAINT PK_CBMARQ_F_COMPTEGA PRIMARY KEY (id_field) ) GO My Cassandra schema : create table TEST-TABLE1(my_numericfield decimal, id_field int, time_field timestamp, PRIMARY KEY (id_field)); Here is

BQ Load error : Avro parsing error in position 893786302. Size of data block 27406834 is larger than the maximum allowed value 16777216

时间秒杀一切 提交于 2019-12-24 18:31:10
问题 To BigQuery experts, I am working on the process which requires us to represent customers shopping history in way where we concatenate all last 12 months of transactions in a single column for Solr faceting using prefixes. while trying to load this data in BIG Query, we are getting below row limit exceed error. Is there any way to get around this? the actual tuple size is around 64 mb where as the avro limit is 16mb. [ ~]$ bq load --source_format=AVRO --allow_quoted_newlines --max_bad_records

Creating list of dictionaries from big csv

此生再无相见时 提交于 2019-12-24 17:19:34
问题 I've a very big csv file (10 gb) and I'd like to read it and create a list of dictionaries where each dictionary represent a line in the csv. Something like [{'value1': '20150302', 'value2': '20150225','value3': '5', 'IS_SHOP': '1', 'value4': '0', 'value5': 'GA321D01H-K12'}, {'value1': '20150302', 'value2': '20150225', 'value3': '1', 'value4': '0', 'value5': '1', 'value6': 'GA321D01H-K12'}] I'm trying to achieve it using a generator in order to avoid any memories issues, my current code is

Is it possible to define a schema for Google Pub/Sub topics like in Kafka with AVRO?

戏子无情 提交于 2019-12-24 15:54:42
问题 As far as I know, we can define AVRO schemas on Kafka and the topic defined with this schema will only accept the data matching with that schema. It's really useful to validate data structure before accepting into the queue. Is there anything similar in Google Pub/Sub? 回答1: Kafka itself is not validating a schema, and topics therefore do not inherently have schemas other than a pair of byte arrays plus some metadata. It's the serializer that's part of the producing client that performs the

avro json additional field

天涯浪子 提交于 2019-12-24 15:19:42
问题 I have following avro schema { "type":"record", "name":"test", "namespace":"test.name", "fields":[ {"name":"items","type": {"type":"array", "items": {"type":"record","name":"items", "fields":[ {"name":"name","type":"string"}, {"name":"state","type":"string"} ] } } }, {"name":"firstname","type":"string"} ] } when I am using Json decoder and avro encoder to encode Json data: val writer = new GenericDatumWriter[GenericRecord](schema) val reader = new GenericDatumReader[GenericRecord](schema) val

How to use camel-avro-consumer & producer?

十年热恋 提交于 2019-12-24 14:14:01
问题 I dont see an example of how to use camel-avro component to produce and consume kafka avro messages? Currently my camel route is this. what should it be changed in order to work with schema-registry and other props like this using camel-kafka-avro consumer & producer. props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081"); props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class); props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS

Why isn't the AvroCoder deterministic?

折月煮酒 提交于 2019-12-24 10:48:29
问题 AvroCoder.isDeterministic returns false. Why isn't the AvroCoder deterministic? Wouldn't Avro records always be encoded into the same byte stream? Since the Avro Coder isn't deterministic an Avro record can't be used as a Key for a group by operation. What's the best way to turn an Avro record into a key? Should we just use the json representation of the Avro record? 回答1: Based on the Avro specification it looks like only Arrays and Maps have non deterministic binary encoding. Maps look like