How to sort within partitions (and avoid sort across the partitions) using RDD API?

后端未结

关注

 2  1722

It is Hadoop MapReduce shuffle\'s default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross t

相关标签:

2条回答

无人及你

2021-01-02 01:33

I've never had this need before, but my first guess would be to use any of the *Partition* methods (e.g. foreachPartition or mapPartitions) to do the sorting within every partition.

Since they give you a Scala Iterator, you could use it.toSeq and then apply any of the sorting methods of Seq, e.g. sortBy or sortWith or sorted.

0 讨论(0)
发布评论:

提交评论
- 加载中...
你的背包

2021-01-02 01:34
You can use Dataset and sortWithinPartitions method:
```
import spark.implicits._

sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
  .toDF("text")
  .sortWithinPartitions($"text")
  .show

+----+
|text|
+----+
|   d|
|   e|
|   f|
|   a|
|   b|
|   c|
+----+
```
In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.
0 讨论(0)
发布评论:

提交评论
- 加载中...