Why does format(“kafka”) fail with “Failed to find data source: kafka.” (even with uber-jar)?

前端未结

关注

 6  1560

I use HDP-2.6.3.0 with Spark2 package 2.2.0.

I\'m trying to write a Kafka consumer, using the Structured Streaming API, but I\'m getting the following error after su

相关标签:

6条回答

死守一世寂寞

2020-11-30 10:35
I faced the same error. It took me a couple of days to figure out. When you copy dependency from maven repository, in particular "spark-sql-kafka", it contains the line:
```
<scope> provided </scope> 
```
The solution was to remove this line so that the dependency would run in the default "compile" scope. The same thing is true if you use SBT. It's probably worthy to remove it for other dependencies as well if they have it, just in case.
0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2020-11-30 10:41
kafka data source is an external module and is not available to Spark applications by default.

You have to define it as a dependency in your pom.xml (as you have done), but that's just the very first step to have it in your Spark application.
```
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
        <version>2.2.0</version>
    </dependency>
```
With that dependency you have to decide whether you want to create a so-called uber-jar that would have all the dependencies bundled altogether (that results in a fairly big jar file and makes the submission time longer) or use --packages (or less flexible --jars) option to add the dependency at spark-submit time.

(There are other options like storing the required jars on Hadoop HDFS or using Hadoop distribution-specific ways of defining dependencies for Spark applications, but let's keep things simple)

I'd recommend using --packages first and only when it works consider the other options.

Use spark-submit --packages to include the spark-sql-kafka-0-10 module as follows.
```
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
```
Include the other command-line options as you wish.

Uber-Jar Approach

Including all the dependencies in a so-called uber-jar may not always work due to how META-INF directories are handled.

For kafka data source to work (and other data sources in general) you have to ensure that META-INF/services/org.apache.spark.sql.sources.DataSourceRegister of all the data sources are merged (not replace or first or whatever strategy you use).

kafka data sources uses its own META-INF/services/org.apache.spark.sql.sources.DataSourceRegister that registers org.apache.spark.sql.kafka010.KafkaSourceProvider as the data source provider for kafka format.
0 讨论(0)
发布评论:

提交评论
- 加载中...

难免孤独

2020-11-30 10:45

The top answer is correct this solved the issue for me:

assemblyMergeStrategy in assembly := {
  case "reference.conf" => MergeStrategy.concat
  case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat
  case PathList("META-INF", xs@_*) => MergeStrategy.discard
  case _ => MergeStrategy.first
}

0 讨论(0)

夕颜

2020-11-30 10:46
My solution was different, I directly specify spark-sql-kafka package on the submit-job command:
```
.\bin\spark-submit --master local --class "org.myspark.KafkaStream" --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0  <path_to_jar>
```
Related: http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying
0 讨论(0)
发布评论:

提交评论
- 加载中...

后悔当初

2020-11-30 10:55

Even I had similar issue, issue started when we upgraded the Cloudera-Spark version from 2.2 --> 2.3.

Issue was: my uber jar META-INF/serives/org.apache.spark.sql.sources.DataSourceRegister was getting overwritten by similar file which is present in some other jars. Hence it was not able to find the KafkaConsumer entry in 'DataSourceRegister' file.

Resolution: modifying the POM.xml helped me.

<configuration>
                           <transformers>
                                <transformer
                                     implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                     <resource>
                                           META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
                                     </resource>
                                </transformer>
                           </transformers>

0 讨论(0)

南笙

2020-11-30 10:56
For uber-jar, adding ServicesResourceTransformer to shade-plugin works for me.
```
<transformers>
    <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

Why does format(“kafka”) fail with “Failed to find data source: kafka.” (even with uber-jar)?

Uber-Jar Approach