Hive的基本用法

巧了我就是萌 提交于 2020-01-12 16:57:06
一、创建表
hive建表的时候默认的分割符是'\001',若在建表的时候没有指明分隔符,load文件的时候文件的分隔符需要是'\001';
若文件分隔符不是'001',程序不会报错,但表查询的结果会全部为'null';
1、建表的时候指定分隔符:
create table pokes(foo int,bar string) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;
load data local inpath '/root/pokes.txt' into table pokes;
2、替换分隔符
待导入的文件的分隔符与表的分隔符不一致,或者hive导出文件的分隔符需要替换:
hive建表的时候虽然可以指定分隔符,不过用insert overwrite local directory这种方式导出文件时,字段的分隔符会被默认
设置为\001,一般都需要将字段分隔符转换为其它字符,可以使用如下命令
sed -e 's/\x01/\t/g' file
 
二、DDL操作
创建表 
hive> CREATE TABLE pokes (foo INT, bar STRING); 
创建表并创建索引字段ds 
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); 
显示所有表 
hive> SHOW TABLES; 
按正条件(正则表达式)显示表, 
hive> SHOW TABLES '.*s'; 
表添加一列 
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); 
添加一列并增加列字段注释 
hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); 
更改表名 
hive> ALTER TABLE events RENAME TO 3koobecaf; 
删除列 
hive> DROP TABLE pokes; 
元数据存储 
将文件中的数据加载到表中 
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; 
加载本地数据,同时给定分区信息 
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); 
加载DFS数据 ,同时给定分区信息 
hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); 
The above command will load data from an HDFS file/directory to the table. Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous. 
SQL 操作 
按先件查询 
hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>'; 
将查询数据输出至目录 
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>'; 
将查询结果输出至本地目录 
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a; 
选择所有列到本地目录 
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; 
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; 
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; 
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; 
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a WHERE a.ds='<DATE>'; 
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a; 
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; 
将一个表的统计结果插入另一个表中 
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar; 
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; 
JOIN 
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo; 
将多表数据插入到同一表中 
FROM src 
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300; 
将文件流直接插入文件 
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09'; 
This streams the data in the map phase through the script /bin/cat (like hadoop streaming). Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples) 
 
三、复杂类型的数据表,这里列之间以'\t'分割,数组元素之间以','分割
#数据文件内容如下
 1 huangfengxiao   beijing,shanghai,tianjin,hangzhou
 2 linan   changchu,chengdu,wuhan
 
 hive> create table complex(name string,work_locations array)
     > ROW FORMAT DELIMITED
     > FIELDS TERMINATED BY '\t'
     > COLLECTION ITEMS TERMINATED BY ',';
 
 hive> describe complex;
 OK
 name    string
 work_locations  array
 
 hive> LOAD DATA LOCAL INPATH '/home/hadoop/hfxdoc/complex.txt' OVERWRITE INTO TABLE complex
 hive> select * from complex;                                                                
 OK
 huangfengxiao   ["beijing","shanghai","tianjin","hangzhou"]
 linan   ["changchu","chengdu","wuhan"]
 Time taken: 0.125 seconds
 
 hive> select name, work_locations[0] from complex;
 MapReduce Total cumulative CPU time: 790 msec
 Ended Job = job_201301211420_0012
 MapReduce Jobs Launched: 
 Job 0: Map: 1   Cumulative CPU: 0.79 sec   HDFS Read: 296 HDFS Write: 37 SUCCESS
 Total MapReduce CPU Time Spent: 790 msec
 OK
 huangfengxiao   beijing
 linan   changchu
 Time taken: 20.703 seconds
 
四、如何分区
表class(teacher sting,student string,age int)
Mis li huangfengxiao 20
Mis li lijie 21
Mis li dongdong 21
Mis li liqiang 21
Mis li hemeng 21
Mr xu dingding 19
Mr xu wangqiang 19
Mr xu lidong 19
Mr xu hexing 19
如果我们将这个班级成员的数据按teacher来分区
create table classmem(student string,age int) partitioned by(teacher string)
分区文件
classmem_Misli.txt
huangfengxiao 20  
lijie 21          
dongdong 21  
liqiang 21          
hemeng 21 
classmem_MrXu.txt
dingding 19 
wangqiang 19 
lidong 19         
hexing 19   
LOAD DATA LOCAL INPATH '/home/hadoop/hfxdoc/classmem_Misli.txt' INTO TABLE classmem partition (teacher = 'Mis.li')
LOAD DATA LOCAL INPATH '/home/hadoop/hfxdoc/classmem_MrXu.txt' INTO TABLE classmem partition (teacher = 'Mis.Xu')
 
#分区列被默认到最后一列
hive> select * from classmem where teacher = 'Mr.Xu';
OK
dingding        19      NULL    Mr.Xu
wangqiang       19      NULL    Mr.Xu
lidong  19              NULL    Mr.Xu
hexing  19      NULL    Mr.Xu
Time taken: 0.196 seconds
#直接从分区检索,加速;如果where子句的条件不是分区列,那么,这个sql将被编译成mapreduce程序,延时很大。
#所以,我们建立分区,是为了一些常用的筛选查询字段而用的。
 
五、桶的使用?更高效!可取样!主要用于大数据集的取样
桶的原理是对一个表(或者分区)进行切片,选择被切片的字段,设定桶的个数,用字段与个数的hash值进行入桶。
比如bucket.txt数据文件内容如下:
id name age
1 huang 11
2 li 11
3 xu 12
4 zhong 14
5 hu 15
6 liqiang 17
7 zhonghua 19
如果我们想将这个数据表切成3个桶,切片字段为id
那么用id字段hash后,3个桶的内容如下:
桶id hash 3 =0
3 xu 12
6 liqiang 17
桶id hash 3 =1
1 huang 11
4 zhong 14
7 zhonghua 19
桶id hash 3 =2
2 li 11
5 hu 15
这个过程的创建表语句如下:
create table bucketmem (id int,name string,age int) CLUSTERED BY (id) sorted by (id asc) into 3 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
 
LOAD DATA LOCAL INPATH '/home/hadoop/hfxdoc/bucketmem.txt' INTO TABLE bucketmem;
select * from bucketmem tablesample(bucket 1 out of 4)
 
六、实际示例 
创建一个表 
CREATE TABLE u_data ( 
userid INT, 
movieid INT, 
rating INT, 
unixtime STRING) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t' 
STORED AS TEXTFILE; 
下载示例数据文件,并解压缩 
wget http://www.grouplens.org/system/files/ml-data.tar__0.gz 
tar xvzf ml-data.tar__0.gz 
加载数据到表中 
LOAD DATA LOCAL INPATH 'ml-data/u.data' 
OVERWRITE INTO TABLE u_data; 
统计数据总量 
SELECT COUNT(1) FROM u_data; 
现在做一些复杂的数据分析 
创建一个 weekday_mapper.py: 文件,作为数据按周进行分割 
import sys 
import datetime 
for line in sys.stdin: 
line = line.strip() 
userid, movieid, rating, unixtime = line.split('\t') 
生成数据的周信息 
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() 
print '\t'.join([userid, movieid, rating, str(weekday)]) 
使用映射脚本 
//创建表,按分割符分割行中的字段值 
CREATE TABLE u_data_new ( 
userid INT, 
movieid INT, 
rating INT, 
weekday INT) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'; 
//将python文件加载到系统 
add FILE weekday_mapper.py; 
将数据按周进行分割 
INSERT OVERWRITE TABLE u_data_new 
SELECT 
TRANSFORM (userid, movieid, rating, unixtime) 
USING 'python weekday_mapper.py' 
AS (userid, movieid, rating, weekday) 
FROM u_data; 
SELECT weekday, COUNT(1) 
FROM u_data_new 
GROUP BY weekday; 
处理Apache Weblog 数据 
将WEB日志先用正则表达式进行组合,再按需要的条件进行组合输入到表中 
add jar ../build/contrib/hive_contrib.jar; 
CREATE TABLE apachelog ( 
host STRING, 
identity STRING, 
user STRING, 
time STRING, 
request STRING, 
status STRING, 
size STRING, 
referer STRING, 
agent STRING) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?", 
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" 
STORED AS TEXTFILE;
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!