I have a Pair RDD (K, V)
with the key containing a time
and an ID
. I would like to get a Pair RDD of the form (K, Iterable
The Spark Programming Guide offers three alternatives if one desires predictably ordered data following shuffle:
mapPartitions
to sort each partition using, for example,.sorted
repartitionAndSortWithinPartitions
to efficiently sort partitions while simultaneously repartitioningsortBy
to make a globally ordered RDD
As written in the Spark API, repartitionAndSortWithinPartitions
is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.
The sorting, however, is computed by looking only at the keys K
of tuples (K, V)
. The trick is to put all the relevant informations in the first element of the tuple, like ((K, V), null)
, defining a custom partitioner and a custom ordering. This article descrives pretty well the technique.