Spark dataframe write method writing many small files

前端未结

关注

 6  1654

I\'ve got a fairly simple job coverting log files to parquet. It\'s processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12

相关标签:

6条回答

误落风尘

2020-11-27 18:16
Duplicating my answer from here: https://stackoverflow.com/a/53620268/171916

This is working for me very well:
```
data.repartition(n, "key").write.partitionBy("key").parquet("/location")
```
It produces N files in each output partition (directory), and is (anecdotally) faster than using coalesce and (again, anecdotally, on my data set) faster than only repartitioning on the output.

If you're working with S3, I also recommend doing everything on local drives (Spark does a lot of file creation/rename/deletion during write outs) and once it's all settled use hadoop FileUtil (or just the aws cli) to copy everything over:
```
import java.net.URI
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
// ...
  def copy(
          in : String,
          out : String,
          sparkSession: SparkSession
          ) = {
    FileUtil.copy(
      FileSystem.get(new URI(in), sparkSession.sparkContext.hadoopConfiguration),
      new Path(in),
      FileSystem.get(new URI(out), sparkSession.sparkContext.hadoopConfiguration),
      new Path(out),
      false,
      sparkSession.sparkContext.hadoopConfiguration
    )
  }
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

天涯浪人

2020-11-27 18:21

In Python you can rewrite Raphael's Roth answer as:

(df
  .repartition("date")
  .write.mode("append")
  .partitionBy("date")
  .parquet("{path}".format(path=path)))

You might also consider adding more columns to .repartition to avoid problems with very large partitions:

(df
  .repartition("date", another_column, yet_another_colum)
  .write.mode("append")
  .partitionBy("date)
  .parquet("{path}".format(path=path)))

0 讨论(0)

陌清茗

2020-11-27 18:22
you have to repartiton your DataFrame to match the partitioning of the DataFrameWriter

Try this:
```
df
.repartition($"date")
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2020-11-27 18:24
I came across the same issue and I could using coalesce solved my problem.
```
df
  .coalesce(3) // number of parts/files 
  .write.mode(SaveMode.Append)
  .parquet(s"$path")
```
For more information on using coalesce or repartition you can refer to the following spark: coalesce or repartition
0 讨论(0)
发布评论:

提交评论
- 加载中...
攒了一身酷

2020-11-27 18:30
The simplest solution would be to replace your actual partitioning by :
```
df
 .repartition(to_date($"date"))
 .write.mode(SaveMode.Append)
 .partitionBy("date")
 .parquet(s"$path")
```
You can also use more precise partitioning for your DataFrame i.e the day and maybe the hour of an hour range. and then you can be less precise for writer. That actually depends on the amount of data.

You can reduce entropy by partitioning DataFrame and the write with partition by clause.
0 讨论(0)
发布评论:

提交评论
- 加载中...

醉酒成梦

2020-11-27 18:31

how about trying running scripts like this as map job consolidating all the parquet files into one:

$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
 -Dmapred.reduce.tasks=1 \
 -input "/hdfs/input/dir" \
 -output "/hdfs/output/dir" \
 -mapper cat \
 -reducer cat

0 讨论(0)