Spark Structural Streaming with Confluent Cloud Kafka connectivity issue

落爺英雄遲暮 提交于 2021-02-04 16:41:16

问题


I am writing a Spark structured streaming application in PySpark to read data from Kafka in Confluent Cloud. The documentation for the spark readstream() function is too shallow and didn't specify much on the optional parameter part especially on the auth mechanism part. I am not sure what parameter goes wrong and crash the connectivity. Can anyone have experience in Spark help me to start this connection?

Required Parameter

> Consumer({'bootstrap.servers':
> 'cluster.gcp.confluent.cloud:9092',
>               'sasl.username':'xxx',
>               'sasl.password':  'xxx',
>               'sasl.mechanisms': 'PLAIN',
>               'security.protocol': 'SASL_SSL',
>     'group.id': 'python_example_group_1',
>     'auto.offset.reset': 'earliest' })

Here is my pyspark code:

df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "cluster.gcp.confluent.cloud:9092") \
  .option("subscribe", "test-topic") \
  .option("kafka.sasl.mechanisms", "PLAIN")\
  .option("kafka.security.protocol", "SASL_SSL")\
  .option("kafka.sasl.username","xxx")\
  .option("kafka.sasl.password", "xxx")\
  .option("startingOffsets", "latest")\
  .option("kafka.group.id", "python_example_group_1")\
  .load()
display(df)

However, I keep getting an error:

kafkashaded.org.apache.kafka.common.KafkaException: Failed to construct kafka consumer

DataBrick Notebook- for testing

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4673082066872014/3543014086288496/1802788104169533/latest.html

Documentation

https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/structured-streaming-kafka-integration.html


回答1:


This error indicates that JAAS configuration is not visible to your Kafka consumer. To solve this issue include JASS based on the follow steps:

Step01: Create a file for below JAAS file : /home/jass/path

KafkaClient {
     com.sun.security.auth.module.Krb5LoginModule required
     useTicketCache=true
     renewTicket=true
     serviceName="kafka";
     };

Step02: Call that JASS file path in spark-submit based on the below conf parameter .

--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path"

Full spark-submit command :

/usr/hdp/2.6.1.0-129/spark2/bin/spark-submit --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.spark:spark-avro_2.11:2.4.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 --conf spark.ui.port=4055 --files /home/jass/path,/home/bdpda/bdpda.headless.keytab --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path" pysparkstructurestreaming.py

Pyspark Structured streaming sample code :

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
import time

#  Spark Streaming context :

spark = SparkSession.builder.appName('PythonStreamingDirectKafkaWordCount').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)

#  Kafka Topic Details :

KAFKA_TOPIC_NAME_CONS = "topic_name"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'kafka_server:9093'

#  Creating  readstream DataFrame :

df = spark.readStream \
     .format("kafka") \
     .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
     .option("subscribe", KAFKA_TOPIC_NAME_CONS) \
     .option("startingOffsets", "earliest") \
     .option("kafka.security.protocol","SASL_SSL")\
     .option("kafka.client.id" ,"Clinet_id")\
     .option("kafka.sasl.kerberos.service.name","kafka")\
     .option("kafka.ssl.truststore.location", "/home/path/kafka_trust.jks") \
     .option("kafka.ssl.truststore.password", "password_rd") \
     .option("kafka.sasl.kerberos.keytab","/home/path.keytab") \
     .option("kafka.sasl.kerberos.principal","path") \
     .load()

df1 = df.selectExpr( "CAST(value AS STRING)")

#  Creating  Writestream DataFrame :

df1.writeStream \
   .option("path","target_directory") \
   .format("csv") \
   .option("checkpointLocation","chkpint_directory") \
   .outputMode("append") \
   .start()

ssc.awaitTermination()



回答2:


We need to specified kafka.sasl.jaas.config to add the username and password for the Confluent Kafka SASL-SSL auth method. Its parameter looks a bit odd, but it's working.

df = spark \
      .readStream \
      .format("kafka") \
      .option("kafka.bootstrap.servers", "pkc-43n10.us-central1.gcp.confluent.cloud:9092") \
      .option("subscribe", "wallet_txn_log") \
      .option("startingOffsets", "earliest") \
      .option("kafka.security.protocol","SASL_SSL") \
      .option("kafka.sasl.mechanism", "PLAIN") \
      .option("kafka.sasl.jaas.config", """kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="xxx" password="xxx";""").load()
display(df)


来源:https://stackoverflow.com/questions/59402916/spark-structural-streaming-with-confluent-cloud-kafka-connectivity-issue

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!