spark.sql.files.maxPartitionBytes not limiting max size of written partitions

问题

I'm trying to copy parquet data from another s3 bucket to my s3 bucket. I want to limit the size of each partition to a max of 128 MB. I thought by default spark.sql.files.maxPartitionBytes would have been set to 128 MB, but when I look at the partition files in s3 after my copy I see individual partition files around 226 MB instead. I was looking at this post which suggested that I set this spark config key in order to limit the max size of my partitions: Limiting maximum size of dataframe partition but it doesn't seem to work?

This is the definition of that config key:

The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.

I'm also a bit confused how this relates to size of the written parquet files.

For reference, I am running a glue script on glue version 1.0, spark 2.4 and the script is this:

val conf: SparkConf = new SparkConf()
conf.set("spark.sql.catalogImplementation", "hive")
    .set("spark.hadoop.hive.metastore.glue.catalogid", catalogId)
val spark: SparkContext = new SparkContext(sparkConf)

val glueContext: GlueContext = new GlueContext(spark)
val sparkSession = glueContext.getSparkSession

val sqlDF = sparkSession.sql("SELECT * FROM db.table where id='item1'")
sqlDF.write.mode(SaveMode.Overwrite).parquet("s3://my-s3-location/")

回答1:

The setting spark.sql.files.maxPartitionBytes has indeed impact on the max size of the partitions when reading the data on the Spark cluster. If your final files after the output are too large, then I suggest decreasing the value of this setting and it should create more files because the input data will be distributed among more partitions. This will however not be true if you have any shuffle in your query because then it will be always repartitioned into the number of partitions given by spark.sql.shuffle.partitions setting.

Also, the final size of your files will depend on the file format and compression that you will use. So if you output the data into for example parquet, the files will be much smaller than outputing to csv or json.

来源：https://stackoverflow.com/questions/62648621/spark-sql-files-maxpartitionbytes-not-limiting-max-size-of-written-partitions

标签

apache-spark

apache-spark-sql

aws-glue