Limiting maximum size of dataframe partition

前端未结

关注

 2  1941

When I write out a dataframe to, say, csv, a .csv file is created for each partition. Suppose I want to limit the max size of each file to, say, 1 MB. I could do the write mul

相关标签:

2条回答

灰色年华

2021-02-13 23:28
1. Single dataframe solution

I was trying to find out some clever idea that would not kill the cluster at the same time and the only thing that came to my mind was:
1. Calculate the size of the serialized row
2. Get no. of rows in your DataFrame
3. Repartition, by dividing with the expected size
4. Should work?
The code should look more like this:
```
val df: DataFrame = ??? // your df
val rowSize = getBytes(df.head)
val rowCount = df.count()
val partitionSize = 1000000 // million bytes in MB?
val noPartitions: Int = (rowSize * rowCount / partitionSize).toInt
df.repartition(noPartitions).write.format(...) // save to csv

// just helper function from https://stackoverflow.com/a/39371571/1549135
def getBytes(value: Any): Long = {
  val stream: ByteArrayOutputStream = new ByteArrayOutputStream()
  val oos = new ObjectOutputStream(stream)
  oos.writeObject(value)
  oos.close
  stream.toByteArray.length
}
```
While my first choice was to calculate each row's byte size, that would be terribly inefficient. So, unless your data size in each row differs in size greatly, I would say that this solution will work. You can also calculate every n-th row size. You got the idea.

Also, I just 'hope' that Long will be big enough to support the expected size to calculate noPartitions. If not (if you have a lot of rows), maybe it would be better to change the operations order, f.e.:
```
val noPartitions: Int = (rowSize / partitionSize * rowCount).toInt
```
again this is just a drafted idea with no domain knowledge about your data.

2. Cross system solution

While going through the apache-spark docs I have found an interesting cross-system solution:

spark.sql.files.maxPartitionBytes which sets:

The maximum number of bytes to pack into a single partition when reading files.

The default value is 134217728 (128 MB).

So I suppose you could set it to 1000000 (1MB) and it will have a permanent effect on your DataFrames. However, too small partition size may greatly impact your performance!

You can set it up, during SparkSession creation:
```
val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.sql.files.maxPartitionBytes", 100000)
  .getOrCreate()
```
All of above is only valid if (I remember correctly and) the csv is partitioned with the same number of files as there are partitions of DataFrame.
0 讨论(0)
发布评论:

提交评论
- 加载中...
刺人心

2021-02-13 23:42
```
    val df = spark.range(10000000)
    df.cache     
    val catalyst_plan = df.queryExecution.logical
    val df_size_in_bytes = spark.sessionState.executePlan(catalyst_plan).optimizedPlan.stats.sizeInBytes
```
df_size_in_bytes: BigInt = 80000000

The best solution would be take 100 records and estimate the size and apply for all the rows as the above example
0 讨论(0)
发布评论:

提交评论
- 加载中...

Limiting maximum size of dataframe partition

1. Single dataframe solution

2. Cross system solution