问题
I saw several q/a's about writing single file into hdfs,it seems using coalesce(1)
is sufficient.
E.g;
df.coalesce(1).write.mode("overwrite").format(format).save(location)
But how can I specify "exact" number of files that will written after save operation?
So my question is;
If I have dataframe which consist 100 partitions when I make write operation will it write 100 files?
If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce(50)
will it write 50 files?
Is there a way in spark which will allow to specify resulting number of files while writing dataframe into HDFS ?
Thanks
回答1:
Number of output files is in general equal to the number of writing tasks (partitions). Under normal conditions It cannot be smaller (each writer writes its own part and multiple tasks cannot write to the same file), but can be larger if format has non-standard behavior or partitionBy
is used.
Normally
If I have dataframe which consist 100 partitions when I make write operation will it write 100 files?
Yes
If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce(50) will it write 50 files?
And yes.
Is there a way in spark which will allow to specify resulting number of files while writing dataframe into HDFS ?
No.
来源:https://stackoverflow.com/questions/51098198/spark-how-to-specify-number-of-resulting-files-for-dataframe-while-after-writing