窗口函数从小白到精通
到底什么是窗口函数:
在进行分组聚合以后 , 我们还想操作集合以前的数据 使用到窗口函数简单来说就是新开辟一个窗口进行操作
1.相关函数说明
over():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化
current row:当前行 current row
n preceding:往前n行数据 n preceding
n following:往后n行数据 n following
unbounded:起点,unbounded preceding 表示从前面的起点, unbounded following表示到后面的终点 unbound preceding unbound following
lag(col,n):往前第n行数据 lag 参数一 字段 n
lead(col,n):往后第n行数据 lead
ntile(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,ntile返回此行所属的组的编号。注意:n必须为int类型。 ntile(5)
2.数据准备:表格名称为business
name,orderdate,cost
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
3.需求
(1)查询在2017年4月份购买过的顾客及总人数
(2)查询顾客的购买明细及月购买总额
(3)上述的场景,要将cost按照日期进行累加
(4)查询顾客上次的购买时间
(5)查询前20%时间的订单信息
–创建表格导入数据
create table business(
name string,
orderdate string,
cost string
)row format delimited fields terminted by ","
load data local inpath "/hive/business.txt" into table business;
简单实现检验一下数据
select
name ,orderdate
from business
group by name ,orderdate;
表格实现
±------±------------+
| name | orderdate |
±------±------------+
| jack | 2017-01-01 |
| jack | 2017-01-05 |
| jack | 2017-01-08 |
| jack | 2017-02-03 |
| jack | 2017-04-06 |
| mart | 2017-04-08 |
| mart | 2017-04-09 |
| mart | 2017-04-11 |
| mart | 2017-04-13 |
| neil | 2017-05-10 |
| neil | 2017-06-12 |
| tony | 2017-01-02 |
| tony | 2017-01-04 |
| tony | 2017-01-07 |
±------±------------+
(1)查询在2017年4月份购买过的顾客及总人数
select
name,
count(name) over() as counts
from
(
select
name
from
(
select
name ,orderdate
from business
group by name ,orderdate
having substr(orderdate,1,7)="2017-04"
)t1
group by name
)t2;
表格实现
±------±----------------+
| name | count_window_0 |
±------±----------------+
| mart | 2 |
| jack | 2 |
±------±----------------+
(2)查询顾客的购买明细 及月购买总额
select
name,collect_list(a)
from
(
select
name ,concat(orderdate,",",cost)as a
from
business
)t1
group by name;
顾客的购买明细表格实现:
±------±-----------------------------------------------------------------------------------+
| name | _c1 |
±------±-----------------------------------------------------------------------------------+
| jack | [“2017-01-01,10”,“2017-02-03,23”,“2017-01-05,46”,“2017-04-06,42”,“2017-01-08,55”] |
| mart | [“2017-04-08,62”,“2017-04-09,68”,“2017-04-11,75”,“2017-04-13,94”] |
| neil | [“2017-05-10,12”,“2017-06-12,80”] |
| tony | [“2017-01-02,15”,“2017-01-04,29”,“2017-01-07,50”] |
±------±-----------------------------------------------------------------------------------+
按月份购买总额
select
name,
month1,
cost,
sum(cost) over(partition by month1)
from
(
select
name,month(orderdate)as month1,cost
from business
)t1
order by month1 ,name;
按月份购买总额表格实现
±------±--------±------±--------------+
| name | month1 | cost | sum_window_0 |
±------±--------±------±--------------+
| jack | 1 | 10 | 205 |
| jack | 1 | 55 | 205 |
| jack | 1 | 46 | 205 |
| tony | 1 | 50 | 205 |
| tony | 1 | 29 | 205 |
| tony | 1 | 15 | 205 |
| jack | 2 | 23 | 23 |
| jack | 4 | 42 | 341 |
| mart | 4 | 94 | 341 |
| mart | 4 | 75 | 341 |
| mart | 4 | 68 | 341 |
| mart | 4 | 62 | 341 |
| neil | 5 | 12 | 12 |
| neil | 6 | 80 | 80 |
±------±--------±------±--------------+
每个人的月购买总额:
select
name,
month1,
cost,
sum(cost) over(partition by name,month1)
from
(
select
name,month(orderdate)as month1,cost
from business
)t1
order by month1 ,name;
表格实现
±------±--------±------±--------------+
| name | month1 | cost | sum_window_0 |
±------±--------±------±--------------+
| jack | 1 | 55 | 111 |
| jack | 1 | 10 | 111 |
| jack | 1 | 46 | 111 |
| tony | 1 | 50 | 94 |
| tony | 1 | 15 | 94 |
| tony | 1 | 29 | 94 |
| jack | 2 | 23 | 23 |
| jack | 4 | 42 | 42 |
| mart | 4 | 68 | 299 |
| mart | 4 | 62 | 299 |
| mart | 4 | 94 | 299 |
| mart | 4 | 75 | 299 |
| neil | 5 | 12 | 12 |
| neil | 6 | 80 | 80 |
±------±--------±------±--------------+
(3)上述的场景,要将cost按照日期进行累加(窗口函数)
做这个需求需要了解的函数
起始行 unbounded preceding
当前行 current row
前n行 n preceding
后n行 n following
结束行 unbounded following
select
*,
sum(cost) over( ),--从头到尾累加
sum(cost) over(partition by name ),--按名字分区进行累加
sum(cost) over(partition by name order by orderdate desc),--按名字分区进行累加
sum(cost) over(partition by name order by orderdate desc rows between unbounded preceding and current row),
--按名字分区进行累加(从起始行到当前行)
sum(cost) over(partition by name order by orderdate rows between 1 preceding and current row)
--按名字分区日期排序进行累加(从上一行到当前行)
from
business;
表格实现
±---------------±--------------------±---------------±-----±-----±--------------+
| business.name | business.orderdate | business.cost | _c1 | _c2 | sum_window_2 |
±---------------±--------------------±---------------±-----±-----±--------------+
| jack | 2017-01-01 | 10 | 661 | 176 | 10 |
| jack | 2017-01-05 | 46 | 661 | 176 | 56 |
| jack | 2017-01-08 | 55 | 661 | 176 | 101 |
| jack | 2017-02-03 | 23 | 661 | 176 | 78 |
| jack | 2017-04-06 | 42 | 661 | 176 | 65 |
| mart | 2017-04-08 | 62 | 661 | 299 | 62 |
| mart | 2017-04-09 | 68 | 661 | 299 | 130 |
| mart | 2017-04-11 | 75 | 661 | 299 | 143 |
| mart | 2017-04-13 | 94 | 661 | 299 | 169 |
| neil | 2017-05-10 | 12 | 661 | 92 | 12 |
| neil | 2017-06-12 | 80 | 661 | 92 | 92 |
| tony | 2017-01-02 | 15 | 661 | 94 | 15 |
| tony | 2017-01-04 | 29 | 661 | 94 | 44 |
| tony | 2017-01-07 | 50 | 661 | 94 | 79 |
±---------------±--------------------±---------------±-----±-----±--------------+
(4)查询顾客上次的购买时间
需要掌握的函数lag()
select
*,
lag(cost) over(partition by name order by orderdate),
lag(orderdate,1,"人家是第一次") over(partition by name order by orderdate)
from
business;
±---------------±--------------------±---------------±--------------±--------------+
| business.name | business.orderdate | business.cost | lag_window_0 | lag_window_1 |
±---------------±--------------------±---------------±--------------±--------------+
| jack | 2017-01-01 | 10 | NULL | 人家是第一次 |
| jack | 2017-01-05 | 46 | 10 | 2017-01-01 |
| jack | 2017-01-08 | 55 | 46 | 2017-01-05 |
| jack | 2017-02-03 | 23 | 55 | 2017-01-08 |
| jack | 2017-04-06 | 42 | 23 | 2017-02-03 |
| mart | 2017-04-08 | 62 | NULL | 人家是第一次 |
| mart | 2017-04-09 | 68 | 62 | 2017-04-08 |
| mart | 2017-04-11 | 75 | 68 | 2017-04-09 |
| mart | 2017-04-13 | 94 | 75 | 2017-04-11 |
| neil | 2017-05-10 | 12 | NULL | 人家是第一次 |
| neil | 2017-06-12 | 80 | 12 | 2017-05-10 |
| tony | 2017-01-02 | 15 | NULL | 人家是第一次 |
| tony | 2017-01-04 | 29 | 15 | 2017-01-02 |
| tony | 2017-01-07 | 50 | 29 | 2017-01-04 |
±---------------±--------------------±---------------±--------------±--------------+
desc function lag–看看lag这个函数怎么用
(5)查询前20%时间的订单信息
需要掌握的函数: **ntile()**把分区的行分发到指定数据的组中 各个组有编号 编号从1开始,对于每一行,ntile返回此行所属的组的编号,注意:n必须为int类型,
你传参ntile(5)就是说你的数据从前到后变成了5份!(标号)
先将数据分成5份
select
*,
ntile(5) over (order by orderdate)
from
business;
表格实现
±---------------±--------------------±---------------±----------------+
| business.name | business.orderdate | business.cost | ntile_window_0 |
±---------------±--------------------±---------------±----------------+
| jack | 2017-01-01 | 10 | 1 |
| tony | 2017-01-02 | 15 | 1 |
| tony | 2017-01-04 | 29 | 1 |
| jack | 2017-01-05 | 46 | 2 |
| tony | 2017-01-07 | 50 | 2 |
| jack | 2017-01-08 | 55 | 2 |
| jack | 2017-02-03 | 23 | 3 |
| jack | 2017-04-06 | 42 | 3 |
| mart | 2017-04-08 | 62 | 3 |
| mart | 2017-04-09 | 68 | 4 |
| mart | 2017-04-11 | 75 | 4 |
| mart | 2017-04-13 | 94 | 4 |
| neil | 2017-05-10 | 12 | 5 |
| neil | 2017-06-12 | 80 | 5 |
±---------------±--------------------±---------------±----------------+
查询前20%时间的订单信息
select
*
from
(
select
*,
ntile(5) over (order by orderdate) n
from
business
)t1
where n=1;
表格实现
±---------±--------------±---------±------+
| t1.name | t1.orderdate | t1.cost | t1.n |
±---------±--------------±---------±------+
| jack | 2017-01-01 | 10 | 1 |
| tony | 2017-01-02 | 15 | 1 |
| tony | 2017-01-04 | 29 | 1 |
±---------±--------------±---------±------+
做完这套联系提窗口函数你已经基本掌握喽…
来源:CSDN
作者:北京小峻
链接:https://blog.csdn.net/weixin_45896475/article/details/103878746