How to query datasets in avro format?

折月煮酒 提交于 2019-12-05 17:38:09

Spark SQL supports avro format through a separate spark-avro module.

A library for reading and writing Avro data from Spark SQL.

Please note that spark-avro is a seaprate module that is not included by default in Spark.

You should load the module using spark-submit --packages, e.g.

$ bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0

See With spark-shell or spark-submit.

Jaceks answer works in general but in my environment it was not working due to obscure reasons. and spark-shell --packages com.databricks:spark-avro_2.11:3.2.0 is hanging for a long with out producing any result.

I solved this problems using --jars option along with spark-shell

Steps :

1) go to copy link address of jar

2) wget .

3) spark-shell --jars <pathwhere you downloaded jar file>/spark-avro_2.11-4.0.0.jar


which got converted in to dataframe and was able to print it.

In your case once you get the dataframe you can do sql on your way.

Note : If you are not using spark-shell you can make uber jar using sbt or maven with spark-avro_2.11-4.0.0.jar using below maven coordinates.


Note : Avro datasource was introduced in spark 2.4 on wards.. SparkSPARK-24768 Have a built-in AVRO data source implementation

Which means that all the above things are not necessary any more. See spark-release-2-4-0 release notes
