how to use order by with collect_set() operation in hive

后端 未结 2 2037
滥情空心
滥情空心 2021-02-04 21:21

In Table 1, I have customer_id, item_id and item_rank (rank of item according to some sales). I want to collect a list of items for each customer_id and arrange them according

相关标签:
2条回答
  • 2021-02-04 21:35

    SELECT customer_id, collect_set(item_id) AS item_list FROM table1 GROUP BY customer_id ORDER BY item_rank

    NOTE : Using collect_list() gives you duplicates and collect_set() gives you unique values.

    0 讨论(0)
  • 2021-02-04 21:47

    You can use a sub-query to get a result set of (customer_id, item_id, item_rank), sorted by item_rank, and then use collect_set in the outer query.

    Query

    WITH table1 AS (
        SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
        SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
        SELECT 23 AS customer_id, 4 AS item_id, 2 AS item_rank UNION ALL
        SELECT 25 AS customer_id, 5 AS item_id, 1 AS item_rank UNION ALL
        SELECT 25 AS customer_id, 4 AS item_id, 2 AS item_rank
    )
    SELECT
        subquery.customer_id,
        collect_set(subquery.item_id) AS item_id_set
    FROM (
        SELECT
            table1.customer_id,
            table1.item_id,
            table1.item_rank
        FROM table1
        DISTRIBUTE BY
            table1.customer_id
        SORT BY
            table1.customer_id,
            table1.item_rank
    ) subquery
    GROUP BY
        subquery.customer_id
    ;
    

    Results

        customer_id item_id_set
    0   23  [4,2]
    1   25  [5,4]
    

    The sub-query uses DISTRIBUTE BY to guarantee that all rows for a particular customer_id route to the same reducer. It then uses SORT BY to sort by customer_id and item_rank within each reducer. I expect this is sufficient for the requirements, because I didn't notice a requirement for total ordering of the final result set. (If total ordering by customer_id is a requirement, then I think the query would have to use ORDER BY, which would cause slower execution.)

    Internally, the collect_set UDAF uses a Java LinkedHashSet, which is an order-preserving collection, so the same sort order used in the sub-query will be maintained in the outer query's set. This is visible in the Hive codebase here:

    https://github.com/apache/hive/blob/release-2.0.0/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java#L93

    0 讨论(0)
提交回复
热议问题