Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:
The number of the output files is equal to the number of partitions of the Dataset
This means you can control it in a number of way, depending on the context:
Datasets
with no wide dependencies you can control input using reader specific parametersDatasets
with wide dependencies you can control number of partitions with spark.sql.shuffle.partitions
parameter.coalesce
or repartition
.is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?
No. With built-in writers it is strictly 1:1 relationship.
you can use size estimator :
import org.apache.spark.util.SizeEstimator
val size = SizeEstimator.estimate(df)
an next you you can adapt the number of files according to the size of the dataframe with repatition or coalesce