Premature end of Content-Length delimited message body SparkException while reading from S3 using Pyspark

我的梦境 提交于 2021-01-28 01:42:06


I am using the below code to read S3 csv file from my local machine.

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
import configparser
import os

conf = SparkConf()
conf.set('spark.jars', '/usr/local/spark/jars/aws-java-sdk-1.7.4.jar,/usr/local/spark/jars/hadoop-aws-2.7.4.jar')

#Tried by setting this, but failed
conf.set('spark.executor.memory', '8g') 
conf.set('spark.driver.memory', '8g') 

spark_session = SparkSession.builder \
        .config(conf=conf) \
        .appName('s3-write') \

# getting S3 credentials from file
aws_profile = "lijo" #user profile name
config = configparser.ConfigParser()"~/.aws/credentials"))
access_key = config.get(aws_profile, "aws_access_key_id") 
secret_key = config.get(aws_profile, "aws_secret_access_key")

# hadoop configuration for S3
hadoop_conf.set("fs.s3a.access.key", access_key)
hadoop_conf.set("fs.s3a.secret.key", secret_key)
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

#Tried by setting this, but no use
hadoop_conf.set("fs.s3a.connection.maximum", "1000") 
hadoop_conf.set("fs.s3.maxConnections", "1000") 
hadoop_conf.set("fs.s3a.connection.establish.timeout", "50000") 
hadoop_conf.set("fs.s3a.socket.recv.buffer", "8192000") 
hadoop_conf.set("fs.s3a.readahead.range", "32M")

# 1) Read csv
df ="s3a://pyspark-lijo-test/auction.csv", header=True,mode="DROPMALFORMED")

Below is my spark standalone configuration details.

[('', ''),
 ('', 'driver'),
 ('', 's3-write'),
 ('', 'local-1594186616260'),
 ('spark.rdd.compress', 'True'),
 ('spark.driver.memory', '8g'),
 ('spark.driver.port', '35497'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.executor.memory', '8g'),
 ('spark.submit.pyFiles', ''),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true')]

But I am getting the below error while reading even a 1MB file.

Py4JJavaError: An error occurred while calling o43.csv.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0,, executor driver): org.apache.spark.util.TaskCompletionListenerException: Premature end of Content-Length delimited message body (expected: 888,879; received: 16,360)
    at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:145)
    at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:124)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(
    at java.base/java.util.concurrent.ThreadPoolExecutor$
    at java.base/

Tried changing the S3 read code to below one and it is working, but we need to convert RDD to Dataframe.

2) data = spark_session.sparkContext.textFile("s3a://pyspark-lijo-test/auction.csv").map(lambda line: line.split(","))

Why is the SparkSql code(1) not able to read even small size file or any setting needs to be done?


Found out the issue. There was some issue in Spark 3.0. Switched to latest Spark 2.4.6 version and it is working fine as expected.

