How to sort within partitions (and avoid sort across the partitions) using RDD API?

后端 未结 2 1722
再見小時候
再見小時候 2021-01-02 00:58

It is Hadoop MapReduce shuffle\'s default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross t

相关标签:
2条回答
  • 2021-01-02 01:33

    I've never had this need before, but my first guess would be to use any of the *Partition* methods (e.g. foreachPartition or mapPartitions) to do the sorting within every partition.

    Since they give you a Scala Iterator, you could use it.toSeq and then apply any of the sorting methods of Seq, e.g. sortBy or sortWith or sorted.

    0 讨论(0)
  • 2021-01-02 01:34

    You can use Dataset and sortWithinPartitions method:

    import spark.implicits._
    
    sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
      .toDF("text")
      .sortWithinPartitions($"text")
      .show
    
    +----+
    |text|
    +----+
    |   d|
    |   e|
    |   f|
    |   a|
    |   b|
    |   c|
    +----+
    

    In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.

    0 讨论(0)
提交回复
热议问题