I have a Pair RDD (K, V)
with the key containing a time
and an ID
. I would like to get a Pair RDD of the form (K, Iterable
The answer from Matei, who I consider authoritative on this topic, is quite clear:
The order is not guaranteed actually, only which keys end up in each partition. Reducers may fetch data from map tasks in an arbitrary order, depending on which ones are available first. If you’d like a specific order, you should sort each partition. Here you might be getting it because each partition only ends up having one element, and collect() does return the partitions in order.
In that context, a better option would be to apply the sorting to the resulting collections per key:
rdd.groupByKey().mapValues(_.sorted)