前期准备
mysql模型:test_max_date(id int,name varchar(255),num int,date date)
hive模型: create table test_date_max(id int,name string,rq Date);
insert into table test_date_max values
(1,"1","2020-12-25"),
(2,"1","2020-12-28"),
(3,"2","2020-12-25"),
(4,"2","2020-12-20")
;
需求
查询每个人最新状态
计算逻辑
每个人有多条数据,日期越大,状态越新
计算过程
mysql:
SELECT id,name,date,max(date) from test_max_date group by name ORDER BY id
hive:
select name,max(rq) from test_date_max group by name;
错误信息说明:在之前的帖子中说过hive groupby的问题。
这里hive中有id,name,日期。id是主键不重复,name是可以重复的,按照name分组,对rq使用max函数,其实是对name去重,返回name每个重复值组中的最大日期
就好比一个公司分了几个部门,部门是确定的,如果是求每个部门年龄最大的,那就是在公司全员信息表中对部门分组,对age求最大。
hive中select 字段和group by 字段必须一一匹配。
如果需要查询完整信息,一下有两种方式(附上sql、结果数据、查询时间)
方式一:
select
a.*
from
test_date_max a
join
(select name,max(rq) as rq from test_date_max group by name) b
on a.rq = b.rq and a.name = b.name
a.id a.name a.rq
2 1 2020-12-28
3 2 2020-12-25
Time taken: 118.387 seconds, Fetched: 2 row(s)
方式二:
select
*
from(
select
*,
row_number()over(partition by name order by rq desc) rank
from
test_date_max
)tmp
where rank=1
tmp.id tmp.name tmp.rq tmp.rank
2 1 2020-12-28 1
3 2 2020-12-25 1
Time taken: 68.587 seconds, Fetched: 2 row(s)
计算日志分析——方式一3个job,方式二1个job
方式一:
hive (test)> select
> a.*
> from
> test_date_max a
> join
> (select name,max(rq) as rq from test_date_max group by name) b
> on a.rq = b.rq and a.name = b.name
> ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = admin_20210204130801_0f13ad17-7887-4a32-984d-088b5453617e
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1611888254670_2374, Tracking URL = http://hdp6.tydic.xian:8088/proxy/application_1611888254670_2374/
Kill Command = /usr/hdp/2.6.5.0-292/hadoop/bin/hadoop job -kill job_1611888254670_2374
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2021-02-04 13:08:27,084 Stage-2 map = 0%, reduce = 0%
2021-02-04 13:08:44,179 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 3.78 sec
2021-02-04 13:08:59,776 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 7.68 sec
MapReduce Total cumulative CPU time: 7 seconds 680 msec
Ended Job = job_1611888254670_2374
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-2.1.1-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.5.0-292/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2021-02-04 13:09:08 Starting to launch local task to process map join; maximum memory = 954728448
2021-02-04 13:09:09 Dump the side-table for tag: 0 with group count: 4 into file: file:/tmp/admin/a935af81-8bbe-4c2c-b2f5-d3bdaa816d9e/hive_2021-02-04_13-08-01_805_2036786040923555355-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile20--.hashtable
2021-02-04 13:09:09 Uploaded 1 File to: file:/tmp/admin/a935af81-8bbe-4c2c-b2f5-d3bdaa816d9e/hive_2021-02-04_13-08-01_805_2036786040923555355-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile20--.hashtable (356 bytes)
2021-02-04 13:09:09 End of local task; Time Taken: 1.24 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1611888254670_2377, Tracking URL = http://hdp6.tydic.xian:8088/proxy/application_1611888254670_2377/
Kill Command = /usr/hdp/2.6.5.0-292/hadoop/bin/hadoop job -kill job_1611888254670_2377
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2021-02-04 13:09:33,954 Stage-3 map = 0%, reduce = 0%
2021-02-04 13:09:59,112 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 3.37 sec
MapReduce Total cumulative CPU time: 3 seconds 370 msec
Ended Job = job_1611888254670_2377
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 7.68 sec HDFS Read: 7794 HDFS Write: 140 SUCCESS
Stage-Stage-3: Map: 1 Cumulative CPU: 3.37 sec HDFS Read: 5282 HDFS Write: 141 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 50 msec
OK
a.id a.name a.rq
2 1 2020-12-28
3 2 2020-12-25
Time taken: 118.387 seconds, Fetched: 2 row(s)
方式二:
hive (test)> select
> *
> from(
> select
> *,
> row_number()over(partition by name order by rq desc) rank
> from
> test_date_max
> )tmp
> where rank=1
> ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = admin_20210204130834_f1469766-42c9-48cb-9194-2cb506a5ff6a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1611888254670_2376, Tracking URL = http://hdp6.tydic.xian:8088/proxy/application_1611888254670_2376/
Kill Command = /usr/hdp/2.6.5.0-292/hadoop/bin/hadoop job -kill job_1611888254670_2376
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-02-04 13:09:07,610 Stage-1 map = 0%, reduce = 0%
2021-02-04 13:09:25,459 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.71 sec
2021-02-04 13:09:42,161 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.97 sec
MapReduce Total cumulative CPU time: 7 seconds 970 msec
Ended Job = job_1611888254670_2376
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.97 sec HDFS Read: 10327 HDFS Write: 145 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 970 msec
OK
tmp.id tmp.name tmp.rq tmp.rank
2 1 2020-12-28 1
3 2 2020-12-25 1
Time taken: 68.587 seconds, Fetched: 2 row(s)
job解析
2021-02-04 16:38:57,881 Stage-1 map = 0%, reduce = 0%
2021-02-04 16:39:13,646 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 sec
2021-02-04 16:39:20,976 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.78 sec
hive默认引擎是mapreduce,将sql转换成mapreduce任务,mapreduce任务分为三个阶段,map,shuffle,reduce,map阶段是读取文件,shuffle是归并排序,并将shuffle过程中的数据溢写的本地,reduce是读写shuffle过程中的文件二次计算将结果写到磁盘,从上面日志可以看出,map阶段不涉及计算,没有cpu耗时,shuffle有归并排序,有cpu计算,有cpu耗时,只不过是做简单计算,reduce阶段有读取、合并,有cpu计算,cpu耗时
来源:oschina
链接:https://my.oschina.net/u/4383709/blog/4948114