发表新帖

发表新帖

Use more than one collect_list in one query in Spark SQL

前端未结

关注

 1  1333

I have the following dataframe data:

root
 |-- userId: string 
 |-- product: string 
 |-- rating: double

and the following que

相关标签:

1条回答

终归单人心

2020-12-15 12:55
I believe there is no explicit guarantee that all arrays will have the same order. Spark SQL uses multiple optimizations and under certain conditions there is no guarantee that all aggregations are scheduled at the same time (one example is aggregation with DISTINCT). Since exchange (shuffle) results in nondeterministic order it is theoretically possible that order will differ.

So while it should work in practice it could be risky and introduce some hard to detect bugs.

If you Spark 2.0.0 or later you can aggregate non-atomic columns with collect_list:
```
SELECT userId, collect_list(struct(product, rating)) FROM data GROUP BY userId
```
If you use an earlier version you can try to use explicit partitions and order:
```
WITH tmp AS (
  SELECT * FROM data DISTRIBUTE BY userId SORT BY userId, product, rating
)
SELECT userId, collect_list(product), collect_list(rating)
FROM tmp
GROUP BY userId
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题