How to sort by value efficiently in PySpark?

前端 未结 1 1098
北恋
北恋 2021-01-04 20:30

I want to sort my K,V tuples by V, i.e. by the value. I know that TakeOrdered is good for this if you know how many you need:

b = sc.paralleliz         


        
相关标签:
1条回答
  • 2021-01-04 21:09

    I think sortBy() is more concise:

    b = sc.parallelize([('t', 3),('b', 4),('c', 1)])
    bSorted = b.sortBy(lambda a: a[1])
    bSorted.collect()
    ...
    [('c', 1),('t', 3),('b', 4)]
    

    It's actually not more efficient at all as it involves keying by the values, sorting by the keys, and then grabbing the values but it looks prettier than your latter solution. In terms of efficiency, I don't think you'll find a more efficient solution as you would need a way to transform your data such that values will be your keys (and then eventually transform that data back to the original schema).

    0 讨论(0)
提交回复
热议问题