It is Hadoop MapReduce shuffle\'s default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross t
I've never had this need before, but my first guess would be to use any of the *Partition*
methods (e.g. foreachPartition
or mapPartitions
) to do the sorting within every partition.
Since they give you a Scala Iterator
, you could use it.toSeq
and then apply any of the sorting methods of Seq, e.g. sortBy
or sortWith
or sorted
.
You can use Dataset
and sortWithinPartitions
method:
import spark.implicits._
sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
.toDF("text")
.sortWithinPartitions($"text")
.show
+----+
|text|
+----+
| d|
| e|
| f|
| a|
| b|
| c|
+----+
In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.