How can I partition a RDD respecting order?

后端 未结 1 1885
既然无缘
既然无缘 2021-01-29 07:52

I would like to divide a RDD in a number of partitions corresponding to the number of different keys I found (3 in this case):

RDD: [(1,a), (1,b), (1,c), (2,d), (3

相关标签:
1条回答
  • 2021-01-29 08:11

    Assigning your partitions like this

    p1 [(1,a), (1,b), (1,c)]
    p2 [(2,d), (3,e), (3,f)]
    p3 [(3,g), (3,h), (3,i)]
    

    would mean that you would like to assign the same partition key to different partitions (for 3 it's p2 or p3). Just like for mathematical functions it cannot have many values for the same argument (what does the value depend on then?).

    What you could do instead is adding something to your partition key which would result in having more buckets (effectively splitting one set into smaller sets). But you have no control (virtually) over how Spark places your partitions onto the nodes so the data which you wanted to be on the same node can span across multiple nodes.

    That really boils down to what job you would like to perform. I would recommend considering what is the outcome you want to get and see if you can come up with some smart partition key with a reasonable tradeoff (if really necessary). Maybe you could hold values by the letter and then use operations like reduceByKey rather than the groupByKey to get your final results?

    0 讨论(0)
提交回复
热议问题