SparkSQL | 窗口函数

佐手、 提交于 2020-03-02 07:42:30

窗口函数的定义引用一个大佬的定义: a window function calculates a return value for every input row of a table based on a group of rows。窗口函数与与其他函数的区别:

  • 普通函数: 作用于每一条记录,计算出一个新列(记录数不变);
  • 聚合函数: 作用于一组记录(全部数据按照某种方式分为多组),计算出一个聚合值(记录数变小);
  • 窗口函数: 作用于每一条记录,逐条记录去指定多条记录来计算一个值(记录数不变)。

窗口函数语法结构: 函数名(参数)OVER(PARTITION BY 子句 ORDER BY 子句 ROWS/RANGE子句)

  • 函数名:
  • OVER: 关键字,说明这是窗口函数,不是普通的聚合函数;
  • 子句
    • PARTITION BY: 分组字段
    • ORDER BY: 排序字段
    • ROWS/RANG窗口子句: 用于控制窗口的尺寸边界,有两种(ROW,RANGE)
      • ROW: 物理窗口,数据筛选基于排序后的index
      • RANGE: 逻辑窗口,数据筛选基于值

主要有以下三种窗口函数

  • ranking functions
  • analytic functions
  • aggregate functions

数据加载

from pyspark.sql.types import *


schema = StructType().add('name', StringType(), True).add('department', StringType(), True).add('salary', IntegerType(), True)
df = spark.createDataFrame([
    ("Tom", "Sales", 4500),
    ("Georgi", "Sales", 4200),
    ("Kyoichi", "Sales", 3000),    
    ("Berni", "Sales", 4700),
    ("Guoxiang", "Sales", 4200),   
    ("Parto", "Finance", 2700),
    ("Anneke", "Finance", 3300),
    ("Sumant", "Finance", 3900),
    ("Jeff", "Marketing", 3100),
    ("Patricio", "Marketing", 2500)
], schema=schema)
df.createOrReplaceTempView('salary')
df.show()

+--------+----------+------+
|    name|department|salary|
+--------+----------+------+
|     Tom|     Sales|  4500|
|  Georgi|     Sales|  4200|
| Kyoichi|     Sales|  3000|
|   Berni|     Sales|  4700|
|Guoxiang|     Sales|  4200|
|   Parto|   Finance|  2700|
|  Anneke|   Finance|  3300|
|  Sumant|   Finance|  3900|
|    Jeff| Marketing|  3100|
|Patricio| Marketing|  2500|
+--------+----------+------+

ranking functions

sql DataFrame 功能
row_number rowNumber 从1~n的唯一序号值
rank rank 与denseRank一样,都是排名,对于相同的数值,排名一致。区别:rank不会跳过并列的排名
dense_rank denseRank 同rank
percent_rank percentRank 计算公式: (组内排名-1)/(组内行数-1),如果组内只有1行,则结果为0
ntile ntile 将组内数据排序后,按照指定的n切分为n个桶,该值为当前行的桶号(桶号从1开始)
spark.sql("""
SELECT
    name 
    ,department
    ,salary
    ,row_number() over(partition by department order by salary) as index
    ,rank() over(partition by department order by salary) as rank
    ,dense_rank() over(partition by department order by salary) as dense_rank
    ,percent_rank() over(partition by department order by salary) as percent_rank
    ,ntile(2) over(partition by department order by salary) as ntile
FROM salary
""").show()
+--------+----------+------+-----+----+----------+------------+-----+
|    name|department|salary|index|rank|dense_rank|percent_rank|ntile|
+--------+----------+------+-----+----+----------+------------+-----+
|Patricio| Marketing|  2500|    1|   1|         1|         0.0|    1|
|    Jeff| Marketing|  3100|    2|   2|         2|         1.0|    2|
| Kyoichi|     Sales|  3000|    1|   1|         1|         0.0|    1|
|  Georgi|     Sales|  4200|    2|   2|         2|        0.25|    1|
|Guoxiang|     Sales|  4200|    3|   2|         2|        0.25|    1|
|     Tom|     Sales|  4500|    4|   4|         3|        0.75|    2|
|   Berni|     Sales|  4700|    5|   5|         4|         1.0|    2|
|   Parto|   Finance|  2700|    1|   1|         1|         0.0|    1|
|  Anneke|   Finance|  3300|    2|   2|         2|         0.5|    1|
|  Sumant|   Finance|  3900|    3|   3|         3|         1.0|    2|
+--------+----------+------+-----+----+----------+------------+-----+

analytic functions

sql DataFrame 功能
cume_dist cumeDist 计算公式: 组内小于等于值当前行数/组内总行数
lag lag lag(input, [offset,[default]]) 当前index<offset返回defalult(默认defalult=null), 否则返回input
lead lead 与lag相反
spark.sql("""
SELECT
    name 
    ,department
    ,salary
    ,row_number() over(partition by department order by salary) as index
    ,cume_dist() over(partition by department order by salary) as cume_dist
    ,lag('salary', 2) over(partition by department order by salary) as lag
    ,lead('salary', 2) over(partition by department order by salary) as lead    
    
FROM salary
""").show()
+--------+----------+------+-----+------------------+------+------+
|    name|department|salary|index|         cume_dist|   lag|  lead|
+--------+----------+------+-----+------------------+------+------+
|Patricio| Marketing|  2500|    1|               0.5|  null|  null|
|    Jeff| Marketing|  3100|    2|               1.0|  null|  null|
| Kyoichi|     Sales|  3000|    1|               0.2|  null|salary|
|  Georgi|     Sales|  4200|    2|               0.6|  null|salary|
|Guoxiang|     Sales|  4200|    3|               0.6|salary|salary|
|     Tom|     Sales|  4500|    4|               0.8|salary|  null|
|   Berni|     Sales|  4700|    5|               1.0|salary|  null|
|   Parto|   Finance|  2700|    1|0.3333333333333333|  null|salary|
|  Anneke|   Finance|  3300|    2|0.6666666666666666|  null|  null|
|  Sumant|   Finance|  3900|    3|               1.0|salary|  null|
+--------+----------+------+-----+------------------+------+------+

aggregate functions

只是在一定窗口里实现一些普通的聚合函数。

sql 功能
avg 平均值
sum 求和
min 最小值
max 最大值
spark.sql("""
SELECT
    name 
    ,department
    ,salary
    ,row_number() over(partition by department order by salary) as index
    ,sum(salary) over(partition by department order by salary) as sum
    ,avg(salary) over(partition by department order by salary) as avg
    ,min(salary) over(partition by department order by salary) as min
    ,max(salary) over(partition by department order by salary) as max    
FROM salary
""").show()
+--------+----------+------+-----+-----+------+----+----+
|    name|department|salary|index|  sum|   avg| min| max|
+--------+----------+------+-----+-----+------+----+----+
|Patricio| Marketing|  2500|    1| 2500|2500.0|2500|2500|
|    Jeff| Marketing|  3100|    2| 5600|2800.0|2500|3100|
| Kyoichi|     Sales|  3000|    1| 3000|3000.0|3000|3000|
|  Georgi|     Sales|  4200|    2|11400|3800.0|3000|4200|
|Guoxiang|     Sales|  4200|    3|11400|3800.0|3000|4200|
|     Tom|     Sales|  4500|    4|15900|3975.0|3000|4500|
|   Berni|     Sales|  4700|    5|20600|4120.0|3000|4700|
|   Parto|   Finance|  2700|    1| 2700|2700.0|2700|2700|
|  Anneke|   Finance|  3300|    2| 6000|3000.0|2700|3300|
|  Sumant|   Finance|  3900|    3| 9900|3300.0|2700|3900|
+--------+----------+------+-----+-----+------+----+----+

窗口子句

ROWS/RANG窗口子句: 用于控制窗口的尺寸边界,有两种(ROW,RANGE)

  • ROWS: 物理窗口,数据筛选基于排序后的index
  • RANGE: 逻辑窗口,数据筛选基于值

语法:OVER (PARTITION BY … ORDER BY … frame_type BETWEEN start AND end)

有以下5种边界

  • CURRENT ROW:
  • UNBOUNDED PRECEDING: 分区第一行
  • UNBOUNDED FOLLOWING: 分区最后一行
  • n PRECEDING: 前n行
  • n FOLLOWING: 后n行
  • UNBOUNDED: 起点
spark.sql("""
SELECT
    name 
    ,department
    ,salary
    ,row_number() over(partition by department order by salary) as index
    ,row_number() over(partition by department order by salary rows between UNBOUNDED PRECEDING and CURRENT ROW) as index1
FROM salary
""").show()
+--------+----------+------+-----+------+
|    name|department|salary|index|index1|
+--------+----------+------+-----+------+
|Patricio| Marketing|  2500|    1|     1|
|    Jeff| Marketing|  3100|    2|     2|
| Kyoichi|     Sales|  3000|    1|     1|
|  Georgi|     Sales|  4200|    2|     2|
|Guoxiang|     Sales|  4200|    3|     3|
|     Tom|     Sales|  4500|    4|     4|
|   Berni|     Sales|  4700|    5|     5|
|   Parto|   Finance|  2700|    1|     1|
|  Anneke|   Finance|  3300|    2|     2|
|  Sumant|   Finance|  3900|    3|     3|
+--------+----------+------+-----+------+

混合应用

spark.sql("""
SELECT
    name 
    ,department
    ,salary
    ,row_number() over(partition by department order by salary) as index
    ,salary - (min(salary) over(partition by department order by salary)) as salary_diff 
FROM salary
""").show()
+--------+----------+------+-----+-----------+
|    name|department|salary|index|salary_diff|
+--------+----------+------+-----+-----------+
|Patricio| Marketing|  2500|    1|          0|
|    Jeff| Marketing|  3100|    2|        600|
| Kyoichi|     Sales|  3000|    1|          0|
|  Georgi|     Sales|  4200|    2|       1200|
|Guoxiang|     Sales|  4200|    3|       1200|
|     Tom|     Sales|  4500|    4|       1500|
|   Berni|     Sales|  4700|    5|       1700|
|   Parto|   Finance|  2700|    1|          0|
|  Anneke|   Finance|  3300|    2|        600|
|  Sumant|   Finance|  3900|    3|       1200|
+--------+----------+------+-----+-----------+

TODO

  • 补充ROWS/RANG窗口子句
  • 补充更多的实践案例

参考

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!