hive函数

在这里插入图片描述
官方链接：
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

hive内置函数

1. 查看内置函数

# 查看内置函数
show functions;
# 显示函数的详细信息：
desc function abs;
# 显示函数的扩展信息：
desc function extended concat;

2. 测试内置函数快捷方式

直接使用

hive> select concat('aa','bb');
OK
aabb
Time taken: 0.058 seconds, Fetched: 1 row(s)

通过表

#1）创建一个 dual 表 
hive> create table dual(id string);
#2）load 一个文件（一行，一个空格）到 dual 表
#3）查表
select substr('huangbo',2,3) from dual;

3. 内置函数列表

3.1 关系运算：

1. 等值比较: =
2. 等值比较:<=>
3. 不等值比较: <>和!=
4. 小于比较: <
5. 小于等于比较: <=
6. 大于比较: >
7. 大于等于比较: >=
8. 区间比较
9. 空值判断: IS NULL
10. 非空判断: IS NOT NULL
10. LIKE 比较: LIKE
11. JAVA 的 LIKE 操作: RLIKE
12. REGEXP 操作: REGEXP

3.2 数学运算：

1. 加法操作: +
2. 减法操作: –
3. 乘法操作: *
4. 除法操作: /
5. 取余操作: %
6. 位与操作: &
7. 位或操作: |
8. 位异或操作: ^
9．位取反操作: ~

3.3 逻辑运算：

1. 逻辑与操作: AND 、&&
2. 逻辑或操作: OR 、||
3. 逻辑非操作: NOT、!

3.4 复合类型构造函数

1. map 结构
2. struct 结构
3. named_struct 结构
4. array 结构
5. create_union

3.5 复合类型操作符

1. 获取 array 中的元素
2. 获取 map 中的元素
3. 获取 struct 中的元素

3.6 数值计算函数

1. 取整函数: round
2. 指定精度取整函数: round
3. 向下取整函数: floor
4. 向上取整函数: ceil
5. 向上取整函数: ceiling
6. 取随机数函数: rand
7. 自然指数函数: exp
8. 以 10 为底对数函数: log10
9. 以 2 为底对数函数: log2
10. 对数函数: log
11. 幂运算函数: pow
12. 幂运算函数: power
13. 开平方函数: sqrt
14. 二进制函数: bin
15. 十六进制函数: hex
16. 反转十六进制函数: unhex
17. 进制转换函数: conv
18. 绝对值函数: abs
19. 正取余函数: pmod
20. 正弦函数: sin
21. 反正弦函数: asin
22. 余弦函数: cos
23. 反余弦函数: acos
24. positive 函数: positive
25. negative 函数: negative

3.7 集合操作函数

1. map 类型大小：size
2. array 类型大小：size
3. 判断元素数组是否包含元素：array_contains
4. 获取 map 中所有 value 集合
5. 获取 map 中所有 key 集合
6. 数组排序

3.8 类型转换函数

1. 二进制转换：binary
2. 基础类型之间强制转换：cast

3.9 日期函数

1. UNIX 时间戳转日期函数: from_unixtime
2. 获取当前 UNIX 时间戳函数: unix_timestamp
3. 日期转 UNIX 时间戳函数: unix_timestamp
4. 指定格式日期转 UNIX 时间戳函数: unix_timestamp
5. 日期时间转日期函数: to_date
6. 日期转年函数: year
7. 日期转月函数: month
8. 日期转天函数: day
9. 日期转小时函数: hour
10. 日期转分钟函数: minute
11. 日期转秒函数: second
12. 日期转周函数: weekofyear
13. 日期比较函数: datediff
14. 日期增加函数: date_add
15. 日期减少函数: date_sub

3.10 条件函数

1. If 函数: if
2. 非空查找函数: COALESCE
3. 条件判断函数：CASE

3.11 字符串函数

1. 字符 ascii 码函数：ascii
2. base64 字符串
3. 字符串连接函数：concat
4. 带分隔符字符串连接函数：concat_ws
5. 数组转换成字符串的函数：concat_ws
6. 小数位格式化成字符串函数：format_number
7. 字符串截取函数：substr,substring
8. 字符串截取函数：substr,substring
9. 字符串查找函数：instr
10. 字符串长度函数：length
11. 字符串查找函数：locate
12. 字符串格式化函数：printf
13. 字符串转换成 map 函数：str_to_map
14. base64 解码函数：unbase64(string str)
15. 字符串转大写函数：upper,ucase
16. 字符串转小写函数：lower,lcase
17. 去空格函数：trim
18. 左边去空格函数：ltrim
19. 右边去空格函数：rtrim
20. 正则表达式替换函数：regexp_replace
21. 正则表达式解析函数：regexp_extract
22. URL 解析函数：parse_url
23. json 解析函数：get_json_object
24. 空格字符串函数：space
25. 重复字符串函数：repeat
26. 左补足函数：lpad
27. 右补足函数：rpad
28. 分割字符串函数: split
29. 集合查找函数: find_in_set
30. 分词函数：sentences
31. 分词后统计一起出现频次最高的 TOP-K
32. 分词后统计与指定单词一起出现频次最高的 TOP-K

3.12 混合函数

1. 调用 Java 函数：java_method
2. 调用 Java 函数：reflect
3. 字符串的 hash 值：hash

3.13 XPath 解析 XML 函数

1. xpath
2. xpath_string
3. xpath_boolean
4. xpath_short, xpath_int, xpath_long
5. xpath_float, xpath_double, xpath_number

3.14 汇总统计函数（UDAF）

1. 个数统计函数: count
2. 总和统计函数: sum
3. 平均值统计函数: avg
4. 最小值统计函数: min
5. 最大值统计函数: max
6. 非空集合总体变量函数: var_pop
7. 非空集合样本变量函数: var_samp
8. 总体标准偏离函数: stddev_pop
9. 样本标准偏离函数: stddev_samp
10．中位数函数: percentile
11. 中位数函数: percentile
12. 近似中位数函数: percentile_approx
13. 近似中位数函数: percentile_approx
14. 直方图: histogram_numeric
15. 集合去重数：collect_set
16. 集合不去重函数：collect_list

3.15 表格生成函数 Table-Generating Functions (UDTF)

1．数组拆分成多行：explode(array)
2．Map 拆分成多行：explode(map)

explode炸裂的原则一个元素生成一行有几个元素就生成几行
实例演示：

hive> create table test_map (id int,name string,scores map<string,int>) row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':';
OK
Time taken: 0.108 seconds
# 数据
1	liujialing	yw:85,sx:45,yy:56
2	huanglei	yw:76,sx:56,yy:78
3	huangjiaju	yw:85,sx:34,yy:78,ty:90
4	liutao	yw:48,sx:23,yy:10
# 导入数据
hive> load data local inpath 'score' into table test_map;
Loading data to table default.test_map
OK
Time taken: 0.424 seconds
hive> select * from test_map;
OK
1	liujialing	{"yw":85,"sx":45,"yy":56}
2	huanglei	{"yw":76,"sx":56,"yy":78}
3	huangjiaju	{"yw":85,"sx":34,"yy":78,"ty":90}
4	liutao	{"yw":48,"sx":23,"yy":10}
Time taken: 0.852 seconds, Fetched: 4 row(s)
#仅仅炸裂数组或集合字段  没问题的
hive> select explode(scores) from test_map;
OK
yw	85
sx	45
yy	56
yw	76
sx	56
yy	78
yw	85
sx	34
yy	78
ty	90
yw	48
sx	23
yy	10

但是当炸裂函数和普通字段一起查询报错：
炸裂函数不支持这种操作
那该怎么办呢？
1）lateral view explode(scores) stc 将explode的炸裂结果存储为一个视图并给视图命名stc

hive> select id,stc.* from test_map lateral view explode(scores) stc;
OK
1	yw	85
1	sx	45
1	yy	56
2	yw	76
2	sx	56
2	yy	78
3	yw	85
3	sx	34
3	yy	78
3	ty	90
4	yw	48
4	sx	23
4	yy	10
Time taken: 0.074 seconds, Fetched: 13 row(s)

2）如果需要视图中的某一个字段需要为字段重命名的

hive> select
    > id,
    > stc.mv
    > from test_map
    > lateral view explode(scores) stc as mk,mv;
OK
1	85
1	45
1	56
2	76
2	56
2	78
3	85
3	34
3	78
3	90
4	48
4	23
4	10
Time taken: 0.05 seconds, Fetched: 13 row(s)

hive自定义函数UDF

当 Hive 提供的内置函数无法满足业务处理需要时，此时就可以考虑使用用户自定义函数
UDF（user-defined function） 作用于单个数据行，产生一个数据行作为输出。（数学函数，字符串函数）
UDAF（用户定义聚集函数 User- Defined Aggregation Funcation）： 接收多个输入数据行，并产生一个输出数据行。（count，max）
UDTF（表格生成函数 User-Defined Table Functions）： 接收一行输入，输出多行（explode）

1. 自定义函数步骤

导包

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>2.3.6</version>
</dependency>

创建一个类，继承 org.apache.hadoop.hive.ql.exec.UDF，重载 evaluate 方法

package com.study.follow.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

public class ToLowerCase extends UDF {
	public String evaluate(String field){
		String result = field.toLowerCase();
		return result;
	}

}

打jar包上传到服务器
将 jar 包添加到 hive 的 classpath

创建临时函数与开发好的 class 关联起来

hive> create temporary function tolowercase as 'com.study.follow.udf.ToLowerCase';

至此，便可以在 hql 使用自定义的函数
```
hive> select tolowercase("HERO");
OK
hero
```

注意：
上面的方法创建的是临时函数，临时函数只对当前客户端有效，客户端退出，函数注销
所以重启客户端，需要重新执行4，5步骤
有永久函数，但是一般不用，因为自定义函数针对特定场景，用的频率不高

2. Json数据解析UDF开发

现有原始 json 数据（rating.json）如下：

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}

现在需要将数据导入到 hive 仓库中，并且最终要得到这么一个结果：

movie	rate	timeStamp	uid
1193	5	978300760	1

该怎么做?（提示：可用内置 get_json_object 或者自定义函数完成）

2.1 get_json_object

先加载 rating.json 文件到 hive 的一个原始表test_json

hive> create table test_json(line string);
hive> load data local inpath 'ratings.json' into table test_json;
# 解析一个json串
hive> select get_json_object('{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}','$.movie');
OK
1193

get_json_object(json_txt, path)

参数1：json格式的字符串
参数2：需要解析的属性在json串中位置
$ : 根目录 json串的根目录 json的最外层的结构
. : 取子节点取对应属性的
[] : 去取数组中的元素中括号中放的是数组下标 0开始
* : 所有

解析test_json表

create table json_final 
as 
select
get_json_object(line,'$.movie') movie,
get_json_object(line,'$.rate') rate,
get_json_object(line,'$.timeStamp') unixtime,
get_json_object(line,'$.uid') as userid 
from test_json;
## 注意：get_json_object 最终解析的数据类型都是string

hive> select * from json_final;
OK
1193	5	978300760	1
661	3	978302109	1
914	3	978301968	1
3408	4	978300275	1
2355	5	978824291	1
1197	3	978302268	1
1287	5	978302039	1
2804	5	978300719	1
594	4	978302268	1
Time taken: 0.118 seconds, Fetched: 9 row(s)

2.2 Transform 实现

Hive 的 TRANSFORM 关键字提供了在 SQL 中调用自写脚本的功能。适合实现 Hive 中没有的
功能又不想写 UDF 的情况
使用 transform+python 的方式去转换 unixtime 为 weekday

编辑python脚本：

## vi weekday_mapper.py
#!/bin/python
import sys
import datetime
for line in sys.stdin:
	line=line.strip()
	movie,rate,unixtime,userid = line.split('\t')
	weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
	print '\t'.join([movie, rate, str(weekday),userid])

创建表存放数据

create table lastjsontable(
movie int,
rate int,
weekday int,
userid int)
row format delimited
fields terminated by '\t';

将文件加入 hive 的 classpath：

hive>add file /home/hadoop/weekday_mapper.py;
hive> insert into table lastjsontable select transform(movie,rate,unixtime,userid)
using 'python weekday_mapper.py' as(movie,rate,weekday,userid) from rate;

查看数据是否正确

hive> select distinct(weekday) from lastjsontable;
Automatically selecting local only mode for query
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hdp01_20200105154512_6171d658-871d-47e8-a443-050af1358462
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-01-05 15:45:13,478 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local844000100_0004
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 49192264 HDFS Write: 2287 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1
7
Time taken: 1.405 seconds, Fetched: 2 row(s)
hive> select * from lastjsontable;
OK
1193	5	1	1
661	3	1	1
914	3	1	1
3408	4	1	1
2355	5	7	1
1197	3	1	1
1287	5	1	1
2804	5	1	1
594	4	1	1
Time taken: 0.099 seconds, Fetched: 9 row(s)

来源：CSDN

作者：霁泽Coding

链接：https://blog.csdn.net/jiajane/article/details/103837796

标签

Hive

hive函数

字符串函数

xpath

【Hive】hive函数

文章目录

hive函数

hive内置函数

1. 查看内置函数

2. 测试内置函数快捷方式

3. 内置函数列表

3.1 关系运算：

3.2 数学运算：

3.3 逻辑运算：

3.4 复合类型构造函数

3.5 复合类型操作符

3.6 数值计算函数

3.7 集合操作函数

3.8 类型转换函数

3.9 日期函数

3.10 条件函数

3.11 字符串函数

3.12 混合函数

3.13 XPath 解析 XML 函数

3.14 汇总统计函数（UDAF）

3.15 表格生成函数 Table-Generating Functions (UDTF)

hive自定义函数UDF

1. 自定义函数步骤

2. Json数据解析UDF开发

2.1 get_json_object

2.2 Transform 实现