spark-avro | 易学教程

Trouble reading avro files in Jupyter notebook using pyspark

阅读更多关于 Trouble reading avro files in Jupyter notebook using pyspark

问题 I am trying to read an avro file in Jupyter notebook using pyspark. When I read the file i am getting an error. I have downloaded spark-avro_2.11:4.0.0.jar, i am not sure where in my code I should be inserting the avro package. Any suggestions would be great. This is an example of the code I am using to read the avro file df_avro_example = sqlContext.read.format("com.databricks.spark.avro").load("example_file.avro") This is the error I get AnalysisException: 'Failed to find data source: com

How to write spark dataframe in a single file in local system without using coalesce

阅读更多关于 How to write spark dataframe in a single file in local system without using coalesce

问题 I want to generate an avro file from a pyspark dataframe and currently I am doing coalesce as below df = df.coalesce(1) df.write.format('avro').save('file:///mypath') But this is leading to memory issues now as all the data will be fetched to memory before writing and my data size is growing consistently everyday. So I want to write the data by each partition so that the data would be written to disk in chunks and doesnot raise OOM issues. I found that toLocalIterator helps in achieving this.

How to use spark-avro package to read avro file from spark-shell?

阅读更多关于 How to use spark-avro package to read avro file from spark-shell?

问题 I'm trying to use the spark-avro package as described in Apache Avro Data Source Guide. When I submit the following command: val df = spark.read.format("avro").load("~/foo.avro") I get an error: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:232) at java.util.ServiceLoader.access$100(ServiceLoader.java:185) at java.util

spark sql error when reading data from Avro Table

阅读更多关于 spark sql error when reading data from Avro Table

问题 When I try reading data from an avro table using spark-sql, I am getting this error. Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:142) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:91) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker

avro json additional field

阅读更多关于 avro json additional field

问题 I have following avro schema { "type":"record", "name":"test", "namespace":"test.name", "fields":[ {"name":"items","type": {"type":"array", "items": {"type":"record","name":"items", "fields":[ {"name":"name","type":"string"}, {"name":"state","type":"string"} ] } } }, {"name":"firstname","type":"string"} ] } when I am using Json decoder and avro encoder to encode Json data: val writer = new GenericDatumWriter[GenericRecord](schema) val reader = new GenericDatumReader[GenericRecord](schema) val

How to query datasets in avro format?

阅读更多关于 How to query datasets in avro format?

问题 this works with parquet val sqlDF = spark.sql("SELECT DISTINCT field FROM parquet.`file-path'") I tried the same way with Avro but it keeps giving me an error even if i use com.databricks.spark.avro . When I execute the following query: val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM avro.`file path`") I get the AnalysisException . Why? org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Please find an Avro package at http://spark.apache.org

How to query datasets in avro format?

阅读更多关于 How to query datasets in avro format?

this works with parquet val sqlDF = spark.sql("SELECT DISTINCT field FROM parquet.`file-path'") I tried the same way with Avro but it keeps giving me an error even if i use com.databricks.spark.avro . When I execute the following query: val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM avro.`file path`") I get the AnalysisException . Why? org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;; line 1 pos 51 at org.apache.spark.sql.catalyst.analysis.package

Handling schema changes in running Spark Streaming application

阅读更多关于 Handling schema changes in running Spark Streaming application

问题 I am looking to build a Spark Streaming application using the DataFrames API on Spark 1.6. Before I get too far down the rabbit hole, I was hoping someone could help me understand how DataFrames deals with data having a different schema. The idea is that messages will flow into Kafka with an Avro schema. We should be able to evolve the schema in backwards compatible ways without having to restart the streaming application (the application logic will still work). It appears trivial to

Handling schema changes in running Spark Streaming application

阅读更多关于 Handling schema changes in running Spark Streaming application

I am looking to build a Spark Streaming application using the DataFrames API on Spark 1.6. Before I get too far down the rabbit hole, I was hoping someone could help me understand how DataFrames deals with data having a different schema. The idea is that messages will flow into Kafka with an Avro schema. We should be able to evolve the schema in backwards compatible ways without having to restart the streaming application (the application logic will still work). It appears trivial to deserialize new versions of messages using a schema registry and the schema id embedded in the message using

How to convert nested avro GenericRecord to Row

阅读更多关于 How to convert nested avro GenericRecord to Row

问题 I have a code to convert my avro record to Row using function avroToRowConverter() directKafkaStream.foreachRDD(rdd -> { JavaRDD<Row> newRDD= rdd.map(x->{ Injection<GenericRecord, byte[]> recordInjection = GenericAvroCodecs.toBinary(SchemaRegstryClient.getLatestSchema("poc2")); return avroToRowConverter(recordInjection.invert(x._2).get()); }); This function is not working for nested schema (TYPE= UNION) . private static Row avroToRowConverter(GenericRecord avroRecord) { if (null == avroRecord