TypeError: 'JavaPackage' object is not callable & Spark Streaming's Kafka libraries not found in class path

问题

I use pyspark streaming to read kafka data, but it went wrong:

import os
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
from pyspark import SparkContext
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell'
sc = SparkContext(appName="test")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
kafkaStream.map(lambda x: x.split(" ")).pprint()

ssc.start()
ssc.awaitTermination()

________________________________________________________________________________________________

Spark Streaming's Kafka libraries not found in class path. Try one of the following.

1. Include the Kafka library and its dependencies with in the
 spark-submit command as

 $ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 ...

2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
 Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.3.
 Then, include the jar in the spark-submit command as

 $ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...

________________________________________________________________________________________________


Traceback (most recent call last):
File "/home/docs/dp_model/dp_algo_platform/dp_algo_core/test/test.py", line 29, in <module>
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 78, in createStream
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 217, in _get_helper
TypeError: 'JavaPackage' object is not callable

My spark version: 2.4.3, kafka version: 2.1.0, and I replace os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell' with os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 pyspark-shell', it cannot work either. How can I do it?

回答1:

I think you should move around your imports such that the environment is loaded with the variable before you import and initialize the Spark variables

You also definitely need to be using the same version of packages as your Spark version

import os
sparkVersion = '2.4.3'  # update this accordingly 
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:{} pyspark-shell'.format(sparkVersion) 

# import Spark core 
from pyspark.sql import SparkSession 
from pyspark.streaming import StreamingContext
# import extra packages 
from pyspark.streaming.kafka import KafkaUtils


# begin application 
spark = SparkSession.builder.appName("test").getOrCreate() 
sc = spark.sparkContext

Note: Kafka 0.8 support is deprecated as of Spark 2.3.0

来源：https://stackoverflow.com/questions/59598135/typeerror-javapackage-object-is-not-callable-spark-streamings-kafka-librar

标签

apache-spark

pyspark

apache-kafka

spark-streaming