avro

Schema avro is in timestamp but in bigquery comes as integer

匆匆过客 提交于 2020-01-25 03:10:08
问题 I have a pipe that uploads avro files to bigquery, the configured schema seems to be ok, but BigQuery understands as an integer value and not a date field. What can I do in this case? Schema´s avro - Date field: { "name": "date", "type": { "type": "long", "logicalType": "timestamp-millis" }, "doc": "the date where the transaction happend" } Big Query table: I tried using the code below but it simply ignores it. You know the reason? import gcloud from gcloud import storage from google.cloud

Is there anyway to compare two avro files to see what differences exist in the data?

廉价感情. 提交于 2020-01-23 03:55:26
问题 Ideally, I'd like something packaged like SAS proc compare that can give me: The count of rows for each dataset The count of rows that exist in one dataset, but not the other Variables that exist in one dataset, but not the other Variables that do not have the same format in the two files (I realize this would be rare for AVRO files, but would be helpful to know quickly without deciphering errors) The total number of mismatching rows for each column, and a presentation of all the mismatches

Dynamically create Hive external table with Avro schema on Parquet Data

孤人 提交于 2020-01-19 15:39:10
问题 I'm trying to dynamically (without listing column names and types in Hive DDL) create a Hive external table on parquet data files. I have the Avro schema of underlying parquet file. My try is to use below DDL: CREATE EXTERNAL TABLE parquet_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath' TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc'); My Hive table is successfully created with the right schema, but

Dynamically create Hive external table with Avro schema on Parquet Data

依然范特西╮ 提交于 2020-01-19 15:39:07
问题 I'm trying to dynamically (without listing column names and types in Hive DDL) create a Hive external table on parquet data files. I have the Avro schema of underlying parquet file. My try is to use below DDL: CREATE EXTERNAL TABLE parquet_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath' TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc'); My Hive table is successfully created with the right schema, but

How to decode/deserialize Kafka Avro strings with Python

本秂侑毒 提交于 2020-01-14 03:30:08
问题 I am receiving from a remote server Kafka Avro messages in Python (using the consumer of Confluent Kafka Python library), that represent clickstream data with json dictionaries with fields like user agent, location, url, etc. Here is what a message looks like: b'\x01\x00\x00\xde\x9e\xa8\xd5\x8fW\xec\x9a\xa8\xd5\x8fW\x1axxx.xxx.xxx.xxx\x02:https://website.in/rooms/\x02Hhttps://website.in/wellness-spa/\x02\xaa\x14\x02\x9c\n\x02\xaa\x14\x02\xd0\x0b\x02V0:j3lcu1if:rTftGozmxSPo96dz1kGH2hvd0CREXmf2

How to populate the cache in CachedSchemaRegistryClient without making a call to register a new schema?

扶醉桌前 提交于 2020-01-13 11:58:13
问题 we have a spark streaming application which integrates with Kafka, I'm trying to optimize it because it makes excessive calls to Schema Registry to download schema. The avro schema for our data rarely changes, and currently our application calls the Schema Registry whenever a record comes in, which is way too much. I ran into CachedSchemaRegistryClient from confluent, and it looked promising. Though after looking into its implementation I'm not sure how to use its built-in cache to reduce the

Apache Nifi - Extract Attributes From Avro

纵饮孤独 提交于 2020-01-13 09:43:09
问题 I'm trying to get my head around on extracting attributes from Avro and JSON. I'm able to extract attributes from JSON by using EvaluateJsonPath processor. I'm trying to do the same on Avro, but i'm not sure whether it is achievable. Here is my flow, ExecuteSQL -> SplitAvro -> UpdateAttribute UpdateAttribute is the processor where i want to extract the attributes. Please find below snapshot of UpdateAttribute processor, So, my basic question is, could we extract attributes form Avro? If yes,

Backward Comaptibility issue and uncertainity in Schema Registry

我只是一个虾纸丫 提交于 2020-01-11 11:26:13
问题 I have a use case where I have a JSON and I want to generate schema and record out of the JSON and publish a record. I have configured the value serializer and Schema setting is Backward compatible. First JSON String json = "{\n" + " \"id\": 1,\n" + " \"name\": \"Headphones\",\n" + " \"price\": 1250.0,\n" + " \"tags\": [\"home\", \"green\"]\n" + "}\n" ; Version 1 schema registered. Received message in avro console consumer. Second JSON. String json = "{\n" + " \"id\": 1,\n" + " \"price\":

Backward Comaptibility issue and uncertainity in Schema Registry

雨燕双飞 提交于 2020-01-11 11:26:09
问题 I have a use case where I have a JSON and I want to generate schema and record out of the JSON and publish a record. I have configured the value serializer and Schema setting is Backward compatible. First JSON String json = "{\n" + " \"id\": 1,\n" + " \"name\": \"Headphones\",\n" + " \"price\": 1250.0,\n" + " \"tags\": [\"home\", \"green\"]\n" + "}\n" ; Version 1 schema registered. Received message in avro console consumer. Second JSON. String json = "{\n" + " \"id\": 1,\n" + " \"price\":

Data validation in AVRO

痞子三分冷 提交于 2020-01-11 05:26:07
问题 I am new to AVRO and please excuse me if it is a simple question. I have a use case where I am using AVRO schema for record calls. Let's say I have avro schema { "name": "abc", "namepsace": "xyz", "type": "record", "fields": [ {"name": "CustId", "type":"string"}, {"name": "SessionId", "type":"string"}, ] } Now if the input is like { "CustId" : "abc1234" "sessionID" : "000-0000-00000" } I want to use some regex validations for these fields and I want take this input only if it comes in