How can I control the number of output files written from Spark DataFrame?

前端 未结 2 325
猫巷女王i
猫巷女王i 2021-01-21 05:26

Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:

相关标签:
2条回答
  • 2021-01-21 05:41

    The number of the output files is equal to the number of partitions of the Dataset This means you can control it in a number of way, depending on the context:

    • For Datasets with no wide dependencies you can control input using reader specific parameters
    • For Datasets with wide dependencies you can control number of partitions with spark.sql.shuffle.partitions parameter.
    • Independent of the lineage you can coalesce or repartition.

    is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

    No. With built-in writers it is strictly 1:1 relationship.

    0 讨论(0)
  • 2021-01-21 05:47

    you can use size estimator :

    import org.apache.spark.util.SizeEstimator
    val size  = SizeEstimator.estimate(df)
    

    an next you you can adapt the number of files according to the size of the dataframe with repatition or coalesce

    0 讨论(0)
提交回复
热议问题