How to query datasets in avro format?

折月煮酒 提交于 2019-12-05 17:38:09

Spark SQL supports avro format through a separate spark-avro module.

A library for reading and writing Avro data from Spark SQL.

Please note that spark-avro is a seaprate module that is not included by default in Spark.

You should load the module using spark-submit --packages, e.g.

$ bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0

See With spark-shell or spark-submit.

Jaceks answer works in general but in my environment it was not working due to obscure reasons. and spark-shell --packages com.databricks:spark-avro_2.11:3.2.0 is hanging for a long with out producing any result.

I solved this problems using --jars option along with spark-shell

Steps :

1) go to https://mvnrepository.com/artifact/com.databricks/spark-avro_2.11/4.0.0 copy link address of jar http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar

2) wget http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar .

3) spark-shell --jars <pathwhere you downloaded jar file>/spark-avro_2.11-4.0.0.jar

4)spark.read.format("com.databricks.spark.avro").load("s3://MYAVROLOCATION.avro")

which got converted in to dataframe and was able to print it.

In your case once you get the dataframe you can do sql on your way.

Note : If you are not using spark-shell you can make uber jar using sbt or maven with spark-avro_2.11-4.0.0.jar using below maven coordinates.

<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-avro_2.11</artifactId>
    <version>4.0.0</version>
</dependency>

Note : Avro datasource was introduced in spark 2.4 on wards.. SparkSPARK-24768 Have a built-in AVRO data source implementation

Which means that all the above things are not necessary any more. See spark-release-2-4-0 release notes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!