The "problem" is indeed a feature, and it is produced by how your RDD
is partitioned, hence it is separated in n
parts where n
is the number of partitions. To fix this you just need to change the number of partitions to one, by using repartition on your RDD
. The documentation states:
repartition(numPartitions)
Return a new RDD that has exactly numPartitions partitions.
Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are
decreasing the number of partitions in this RDD, consider using
coalesce, which can avoid performing a shuffle.
For example, this change should work.
myRDD.map(x => x._1 + "," + x._2).repartition(1).saveAsTextFile("/path/to/output")
As the documentation says you can also use coalesce, which is actually the recommended option when you are reducing the number of partitions. However, reducing the number of partitions to one is considered a bad idea, because it causes shuffling of the data to one node and loss of parallelism.