发表新帖

发表新帖

Spark saveAsTextFile() writes to multiple files instead of one [duplicate]

后端未结

关注

 1  1048

相关标签:

1条回答

梦如初夏

2021-02-10 05:28
The "problem" is indeed a feature, and it is produced by how your RDD is partitioned, hence it is separated in n parts where n is the number of partitions. To fix this you just need to change the number of partitions to one, by using repartition on your RDD. The documentation states:

repartition(numPartitions)

Return a new RDD that has exactly numPartitions partitions.

Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

For example, this change should work.
```
myRDD.map(x => x._1 + "," + x._2).repartition(1).saveAsTextFile("/path/to/output")
```
As the documentation says you can also use coalesce, which is actually the recommended option when you are reducing the number of partitions. However, reducing the number of partitions to one is considered a bad idea, because it causes shuffling of the data to one node and loss of parallelism.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题