Using Spark streaming to read Json data from Kafka topic. I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:
you can use size estimator :
import org.apache.spark.util.SizeEstimator val size = SizeEstimator.estimate(df)
an next you you can adapt the number of files according to the size of the dataframe with repatition or coalesce