I would like to divide a RDD in a number of partitions corresponding to the number of different keys I found (3 in this case):
RDD: [(1,a), (1,b), (1,c), (2,d), (3
Assigning your partitions like this
p1 [(1,a), (1,b), (1,c)]
p2 [(2,d), (3,e), (3,f)]
p3 [(3,g), (3,h), (3,i)]
would mean that you would like to assign the same partition key to different partitions (for 3 it's p2 or p3). Just like for mathematical functions it cannot have many values for the same argument (what does the value depend on then?).
What you could do instead is adding something to your partition key which would result in having more buckets (effectively splitting one set into smaller sets). But you have no control (virtually) over how Spark places your partitions onto the nodes so the data which you wanted to be on the same node can span across multiple nodes.
That really boils down to what job you would like to perform. I would recommend considering what is the outcome you want to get and see if you can come up with some smart partition key with a reasonable tradeoff (if really necessary). Maybe you could hold values by the letter and then use operations like reduceByKey
rather than the groupByKey
to get your final results?