How can I control the number of output files written from Spark DataFrame?

前端未结

关注

 2  326

猫巷女王i 2021-01-21 05:26

Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:

2条回答

不思量自难忘° (楼主)

2021-01-21 05:47
you can use size estimator :
```
import org.apache.spark.util.SizeEstimator
val size  = SizeEstimator.estimate(df)
```
an next you you can adapt the number of files according to the size of the dataframe with repatition or coalesce
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...