A bad issue with kafka and Spark Streaming on Python

问题

N.B. This is NOT the same issue that I had in my first post on this site, however it is the same project.

I'm ingesting some files into PostgreSQL from kafka using spark streaming. These are my steps for the project:

1- Creating a script for the kafka producer (done, it works fine)

2- Creating a python script that reads files from kafka producer

3- Sending files to PostgreSQL

For the connection between python and postgreSQL I use psycopg2. I am also using python 3 and java jdk1.8.0_261 and the integration between kafka and spark streaming works fine. I have kafka 2.12-2.6.0 and spark 3.0.1 and I added these jars in my Spark jars directory:

postgresql-42.2.18 -spark-streaming-kafka-0-10-assembly_2.12-3.0.1
spark-token-provider-kafka-0.10_2.12-3.0.1
kafka-clients-2.6.0
spark-sql-kafka-0-10-assembly_2.12-3.0.1

I had also to download VC++ in order to fix another issue also related to my project.

This is my piece of the python code that takes files from kafka producer and sends them into a table of postgreSQL, that I have created in postgreSQL, at which I have problems:

query = satelliteTable.writeStream.outputMode("append").foreachBatch(process_row) \
.option("checkpointLocation", "C:\\Users\\Vito\\Documents\\popo").start()
print("Starting")
print(query)
query.awaitTermination()
query.stop()

satelliteTable is the spark dataframe that I have created with files from kafka producer. process_row is the function that inserts each row of the streaming dataframe into the postgre table. Here it is:

def process_row(df, epoch_id):
for row in df.rdd.collect():
    cursor1.execute(
        'INSERT INTO satellite(filename,satellite_prn_number, date, time,crs,delta_n, m0, 
                   cuc,e_eccentricity,cus,'
        'sqrt_a, toe_time_of_ephemeris, cic, omega_maiusc, cis, i0, crc, omega, omega_dot, idot) 
                    VALUES (%s,%s,%s,'
        '%s,%s,%s, %s, %s, %s, %s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)', row)
connection.commit()
pass

The issue that I get when I run my code happens at query = satelliteTable.writeStream.outputMode("append").foreachBatch(process_row) \ .option("checkpointLocation", "C:\\Users\\Vito\\Documents\\popo").start() and in short it is the following:

py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 
times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, DESKTOP- 
D600TY.homenet.telecomitalia.it, executor driver): java.lang.NoClassDefFoundError: 
org/apache/commons/pool2/impl/GenericKeyedObjectPoolConfig

=== Streaming Query ===
Identifier: [id = 599f75a7-5db6-426e-9082-7fbbf5196db9, runId = 67693586-27b1-4ca7-9a44-0f69ad90eafe]
Current Committed Offsets: {}
Current Available Offsets: {KafkaV2[Subscribe[bogi2890.20n]]: {"bogi2890.20n":{"0":68}}}

Current State: ACTIVE
Thread State: RUNNABLE

The fun fact is that the same code runs fine on my friend's laptop, with spark 3.0.0. So, I think that I am missing some jars or another stuff, because the code is correct.

Any idea? Thanks.

回答1:

you are missing this jar https://mvnrepository.com/artifact/org.apache.commons/commons-pool2 Try that specific version out https://mvnrepository.com/artifact/org.apache.commons/commons-pool2/2.6.2

来源：https://stackoverflow.com/questions/64594199/a-bad-issue-with-kafka-and-spark-streaming-on-python

标签

python

postgresql

apache-kafka

spark-streaming

rdd