How can I control the number of output files written from Spark DataFrame?

前端 未结 2 322
猫巷女王i
猫巷女王i 2021-01-21 05:26

Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:

2条回答
  •  不思量自难忘°
    2021-01-21 05:47

    you can use size estimator :

    import org.apache.spark.util.SizeEstimator
    val size  = SizeEstimator.estimate(df)
    

    an next you you can adapt the number of files according to the size of the dataframe with repatition or coalesce

提交回复
热议问题