Spark parquet partitioning : Large number of files

自作多情 提交于 2019-11-27 18:01:28

First I would really avoid using coalesce, as this is often pushed up further in the chain of transformation and may destroy the parallelism of your job (I asked about this issue here : How to prevent Spark optimization)

Writing 1 file per parquet-partition is realtively easy (see Spark dataframe write method writing many small files):

data.repartition($"key").write.partitionBy("key").parquet("/location")

If you want to set an arbitrary number of files (or files which have all the same size), you need to further repartition your data using another attribute which could be used (I cannot tell you what this might be in your case):

data.repartition($"key",$"another_key").write.partitionBy("key").parquet("/location")

another_key could be another attribute of your dataset, or a derived attribute using some modulo or rounding-operations on existing attributes. You could even use window-functions with row_number over key and then round this by something like

data.repartition($"key",floor($"row_number"/N)*N).write.partitionBy("key").parquet("/location")

This would put you N records into 1 parquet file

using orderBy

You can also control the number of files without repartitioning by ordering your dataframe accordingly:

data.orderBy($"key").write.partitionBy("key").parquet("/location")

This will lead to a total of spark.sql.shuffle.partitions across all partitions (by default 200). It's even beneficial to add a second ordering column after $key, as parquet will remember the ordering of the dataframe and will write the statistics accordingly. For example, you can order by an ID:

data.orderBy($"key",$"id").write.partitionBy("key").parquet("/location")

This will not change the number of files, but it will improve the performance when you query your parquet file for a given key and id. See e.g. https://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide and https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example

Spark 2.2+

From Spark 2.2 on, you can also play with the new option maxRecordsPerFile to limit the number of records per file. You will still get at least N files if you have N partitions, but you can split the file written by 1 partition (task) into smaller chunks:

df.write
.option("maxRecordsPerFile", 10000)
...

See e.g. http://www.gatorsmile.io/anticipated-feature-in-spark-2-2-max-records-written-per-file/ and spark write to disk with N files less than N partitions

This is working for me very well:

data.repartition(n, "key").write.partitionBy("key").parquet("/location")

It produces N files in each output partition (directory), and is (anecdotally) faster than using coalesce and (again, anecdotally, on my data set) faster than only repartitioning on the output.

If you're working with S3, I also recommend doing everything on local drives (Spark does a lot of file creation/rename/deletion during write outs) and once it's all settled use hadoop FileUtil (or just the aws cli) to copy everything over:

import java.net.URI
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
// ...
  def copy(
          in : String,
          out : String,
          sparkSession: SparkSession
          ) = {
    FileUtil.copy(
      FileSystem.get(new URI(in), sparkSession.sparkContext.hadoopConfiguration),
      new Path(in),
      FileSystem.get(new URI(out), sparkSession.sparkContext.hadoopConfiguration),
      new Path(out),
      false,
      sparkSession.sparkContext.hadoopConfiguration
    )
  }

Edit: As per discussion in comments:

You a dataset with a partition column of YEAR, but each given YEAR has vastly different amounts of data in it. So, one year might have 1GB of data, but another might have 100GB.

Here's psuedocode for one way to handle this:

val partitionSize = 10000 // Number of rows you want per output file.
val yearValues = df.select("YEAR").distinct
distinctGroupByValues.each((yearVal) -> {
  val subDf = df.filter(s"YEAR = $yearVal")
  val numPartitionsToUse = subDf.count / partitionSize
  subDf.repartition(numPartitionsToUse).write(outputPath + "/year=$yearVal")
})

But, I don't actually know what this will work. It's possible that Spark will have an issue reading in a variable number of files per column partition.

Another way to do it would be write your own custom partitioner, but I have no idea what's involved in that so I can't supply any code.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!