问题
I have a table:
key product_code cost
1 UK 20
1 US 10
1 EU 5
2 UK 3
2 EU 6
I would like to find the sum of all products for each group of "key" and append to each row. For example for key = 1, find the sum of costs of all products (20+10+5=35) and then append result to all rows which correspond to the key = 1. So end result:
key product_code cost total_costs
1 UK 20 35
1 US 10 35
1 EU 5 35
2 UK 3 9
2 EU 6 9
I would prefer to do this without using a sub-join as this would be inefficient. My best idea would be to use the over
function in conjunction with the sum
function but I cant get it to work. My best try:
SELECT key, product_code, sum(costs) over(PARTITION BY key)
FROM test
GROUP BY key, product_code;
Iv had a look at the docs but there so cryptic I have no idea how to work out how to do it. Im using Hive v0.12.0, HDP v2.0.6, HortonWorks Hadoop distribution.
回答1:
Similar to @VB_ answer, use the BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
statement.
The HiveQL query is therefore:
SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;
回答2:
You could use BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
to achieve that without a self join.
Code as below:
SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM T;
回答3:
The analytics function sum gives cumulative sums. For example, if you did:
select key, product_code, cost, sum(cost) over (partition by key) as total_costs from test
then you would get:
key product_code cost total_costs
1 UK 20 20
1 US 10 30
1 EU 5 35
2 UK 3 3
2 EU 6 9
which, it seems, is not what you want.
Instead, you should use the aggregation function sum, combined with a self join to accomplish this:
select test.key, test.product_code, test.cost, agg.total_cost
from (
select key, sum(cost) as total_cost
from test
group by key
) agg
join test
on agg.key = test.key;
回答4:
This query gives me perfect result
select key, product_code, cost, sum(cost) over (partition by key) as total_costs from zone;
回答5:
similar answer (if we use oracle emp table):
select deptno, ename, sal, sum(sal) over(partition by deptno) from emp;
output will be like below:
deptno ename sal sum_window_0
10 MILLER 1300 8750
10 KING 5000 8750
10 CLARK 2450 8750
20 SCOTT 3000 10875
20 FORD 3000 10875
20 ADAMS 1100 10875
20 JONES 2975 10875
20 SMITH 800 10875
30 BLAKE 2850 9400
30 MARTIN 1250 9400
30 ALLEN 1600 9400
30 WARD 1250 9400
30 TURNER 1500 9400
30 JAMES 950 9400
回答6:
The table above looked like
key product_code cost
1 UK 20
1 US 10
1 EU 5
2 UK 3
2 EU 6
The user wanted a tabel with the total costs like the following
key product_code cost total_costs
1 UK 20 35
1 US 10 35
1 EU 5 35
2 UK 3 9
2 EU 6 9
Therefor we used the following query
SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;
So far so good. I want a column more, counting the occurences of each country
key product_code cost total_costs occurences
1 UK 20 35 2
1 US 10 35 1
1 EU 5 35 2
2 UK 3 9 2
2 EU 6 9 2
Therefor I used the following query
SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as total_costs
COUNT(product code) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as occurences
FROM test;
Sadly this is not working. I get an cryptic error. To exclude an error in my query I want to ask if I did something wrong. Thanks
来源:https://stackoverflow.com/questions/25082057/hive-sum-over-a-specified-group-hiveql