Does collect_list() maintain relative ordering of rows?

前端 未结 1 1035
攒了一身酷
攒了一身酷 2020-12-15 21:15

Imagine that I have the following DataFrame df:

+---+-----------+------------+
| id|featureName|featureValue|
+---+-----------+------------+
|id1|          a         


        
相关标签:
1条回答
  • 2020-12-15 21:38

    I think you can rely on "their relative order" as Spark goes over rows one by one in order (and usually does not re-order rows if not explicitly needed).

    If you are concerned with the order, merge these two columns using struct function before doing groupBy.

    struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.

    You could also use monotonically_increasing_id function to number records and use it to pair with the other columns (perhaps using struct):

    monotonically_increasing_id(): Column A column expression that generates monotonically increasing 64-bit integers.

    The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

    0 讨论(0)
提交回复
热议问题