spark-avro

Trouble reading avro files in Jupyter notebook using pyspark

邮差的信 提交于 2021-01-29 08:51:02
问题 I am trying to read an avro file in Jupyter notebook using pyspark. When I read the file i am getting an error. I have downloaded spark-avro_2.11:4.0.0.jar, i am not sure where in my code I should be inserting the avro package. Any suggestions would be great. This is an example of the code I am using to read the avro file df_avro_example = sqlContext.read.format("com.databricks.spark.avro").load("example_file.avro") This is the error I get AnalysisException: 'Failed to find data source: com

How to write spark dataframe in a single file in local system without using coalesce

↘锁芯ラ 提交于 2021-01-27 21:21:22
问题 I want to generate an avro file from a pyspark dataframe and currently I am doing coalesce as below df = df.coalesce(1) df.write.format('avro').save('file:///mypath') But this is leading to memory issues now as all the data will be fetched to memory before writing and my data size is growing consistently everyday. So I want to write the data by each partition so that the data would be written to disk in chunks and doesnot raise OOM issues. I found that toLocalIterator helps in achieving this.

How to use spark-avro package to read avro file from spark-shell?

拟墨画扇 提交于 2020-02-02 02:11:28
问题 I'm trying to use the spark-avro package as described in Apache Avro Data Source Guide. When I submit the following command: val df = spark.read.format("avro").load("~/foo.avro") I get an error: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:232) at java.util.ServiceLoader.access$100(ServiceLoader.java:185) at java.util

spark sql error when reading data from Avro Table

若如初见. 提交于 2019-12-25 00:13:32
问题 When I try reading data from an avro table using spark-sql, I am getting this error. Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:142) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:91) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker

avro json additional field

天涯浪子 提交于 2019-12-24 15:19:42
问题 I have following avro schema { "type":"record", "name":"test", "namespace":"test.name", "fields":[ {"name":"items","type": {"type":"array", "items": {"type":"record","name":"items", "fields":[ {"name":"name","type":"string"}, {"name":"state","type":"string"} ] } } }, {"name":"firstname","type":"string"} ] } when I am using Json decoder and avro encoder to encode Json data: val writer = new GenericDatumWriter[GenericRecord](schema) val reader = new GenericDatumReader[GenericRecord](schema) val

How to query datasets in avro format?

微笑、不失礼 提交于 2019-12-12 09:46:26
问题 this works with parquet val sqlDF = spark.sql("SELECT DISTINCT field FROM parquet.`file-path'") I tried the same way with Avro but it keeps giving me an error even if i use com.databricks.spark.avro . When I execute the following query: val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM avro.`file path`") I get the AnalysisException . Why? org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Please find an Avro package at http://spark.apache.org

How to query datasets in avro format?

折月煮酒 提交于 2019-12-05 17:38:09
this works with parquet val sqlDF = spark.sql("SELECT DISTINCT field FROM parquet.`file-path'") I tried the same way with Avro but it keeps giving me an error even if i use com.databricks.spark.avro . When I execute the following query: val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM avro.`file path`") I get the AnalysisException . Why? org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;; line 1 pos 51 at org.apache.spark.sql.catalyst.analysis.package

Handling schema changes in running Spark Streaming application

你离开我真会死。 提交于 2019-11-30 09:10:13
问题 I am looking to build a Spark Streaming application using the DataFrames API on Spark 1.6. Before I get too far down the rabbit hole, I was hoping someone could help me understand how DataFrames deals with data having a different schema. The idea is that messages will flow into Kafka with an Avro schema. We should be able to evolve the schema in backwards compatible ways without having to restart the streaming application (the application logic will still work). It appears trivial to

Handling schema changes in running Spark Streaming application

不想你离开。 提交于 2019-11-29 12:10:27
I am looking to build a Spark Streaming application using the DataFrames API on Spark 1.6. Before I get too far down the rabbit hole, I was hoping someone could help me understand how DataFrames deals with data having a different schema. The idea is that messages will flow into Kafka with an Avro schema. We should be able to evolve the schema in backwards compatible ways without having to restart the streaming application (the application logic will still work). It appears trivial to deserialize new versions of messages using a schema registry and the schema id embedded in the message using

How to convert nested avro GenericRecord to Row

对着背影说爱祢 提交于 2019-11-28 04:26:53
问题 I have a code to convert my avro record to Row using function avroToRowConverter() directKafkaStream.foreachRDD(rdd -> { JavaRDD<Row> newRDD= rdd.map(x->{ Injection<GenericRecord, byte[]> recordInjection = GenericAvroCodecs.toBinary(SchemaRegstryClient.getLatestSchema("poc2")); return avroToRowConverter(recordInjection.invert(x._2).get()); }); This function is not working for nested schema (TYPE= UNION) . private static Row avroToRowConverter(GenericRecord avroRecord) { if (null == avroRecord