hive与hbase整合过程

拥有回忆 提交于 2019-11-29 09:59:51



# by coco
# 2014-07-25


 本文主要实现一下目标:
   1. 在hive中创建的表能直接创建保存到hbase中。
   2. hive中的表插入数据,插入的数据会同步更新到hbase对应的表中。
   3. hbase对应的列簇值变更,也会在Hive中对应的表中变更。
   4. 实现了多列,多列簇的转化:(示例:hive中3列对应hbase中2列簇)
   


 hive与hbase的整合
 1. 创建hbase识别的表:
hive>  CREATE TABLE hbase_table_1(key int, value string)    
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")   
    > TBLPROPERTIES ("hbase.table.name" = "xyz");
OK
Time taken: 1.833 seconds
hbase.table.name 定义在hbase的table名称 
hbase.columns.mapping 定义在hbase的列族 
hbase中看到的表:
hbase(main):007:0> list
TABLE                                                                                                                        
hivetest                                                                                                                     
student                                                                                                                      
test                                                                                                                         
xyz                                                                                                                          
4 row(s) in 0.1050 seconds


=> ["hivetest", "student", "test", "xyz"]


2.使用sql导入数据 
i.预先准备数据 
a)新建hive的数据表
hive> create table ccc(foo int,bar string) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;
OK
Time taken: 2.563 seconds
[root@db96 ~]# cat kv1.txt 
1       val_1
2       val_2
这个文件位于root目录下,/root/kv1.txt
  
[root@db96 ~]# 
hive> load data local inpath '/root/kv1.txt' overwrite into table ccc;
Copying data from file:/root/kv1.txt
Copying file: file:/root/kv1.txt
Loading data to table default.ccc
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted hdfs://db96:9000/hive/warehousedir/ccc
[Warning] could not update stats.
OK
Time taken: 2.796 seconds
hive> select * from ccc;
OK
1       val_1
2       val_2
NULL    NULL
Time taken: 0.348 seconds, Fetched: 3 row(s)
hive>
使用sql导入hbase_table_1
hive> insert overwrite table hbase_table_1 select * from ccc where foo=1;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1406161997851_0002, Tracking URL = http://db96:8088/proxy/application_1406161997851_0002/
Kill Command = /usr/local/hadoop//bin/hadoop job  -kill job_1406161997851_0002
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2014-07-24 16:04:48,938 Stage-0 map = 0%,  reduce = 0%
2014-07-24 16:04:57,571 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 2.54 sec
MapReduce Total cumulative CPU time: 2 seconds 540 msec
Ended Job = job_1406161997851_0002
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 2.54 sec   HDFS Read: 217 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 540 msec
OK
Time taken: 27.648 seconds


查看数据
会显示刚刚插入的数据 
1       val_1
hive> select * from hbase_table_1;
OK
1       val_1
Time taken: 1.143 seconds, Fetched: 1 row(s)


hbase 登录hbase
查看加载的数据
hbase(main):008:0> scan "xyz"
ROW                           COLUMN+CELL                                                                          
 1                            column=cf1:val, timestamp=1406189096793, value=val_1                                 
1 row(s) in 0.1090 seconds


hbase(main):009:0> 
可以看到,在hive中添加的数据86,已经在hbase中了.
添加数据:
hbase(main):009:0> put 'xyz','100','cf1:val','www.gongchang.com'
hbase(main):011:0> put 'xyz','200','cf1:val','hello,word!'
hbase(main):012:0> scan "xyz"
ROW                           COLUMN+CELL                                                                          
 1                            column=cf1:val, timestamp=1406189096793, value=val_1                                 
 100                          column=cf1:val, timestamp=1406189669476, value=www.gongchang.com                     
 200                          column=cf1:val, timestamp=1406189704742, value=hello,word!                           
3 row(s) in 0.0240 seconds


Hive 
参看hive中的数据
hive> select * from hbase_table_1;
OK
1       val_1
100     www.gongchang.com
200     hello,word!
Time taken: 1.097 seconds, Fetched: 3 row(s)
hive> 
刚刚在hbase中插入的数据,已经在hive里了.


hive访问已经存在的hbase
hbase中的元数据准备:
hbase(main):014:0> describe "student"
DESCRIPTION                                                                ENABLED                                 
 'student', {NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => true                                    
  'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE',                                         
  MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false',                                         
  BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}                                                
1 row(s) in 0.1380 seconds


hbase(main):015:0> put "student",'1','info:name','tom'
hbase(main):017:0> put "student",'2','info:name','lily'
hbase(main):018:0> put "student",'3','info:name','wwn'
hbase(main):019:0> scan "student"
ROW                           COLUMN+CELL                                                                          
 1                            column=info:name, timestamp=1406189948888, value=tom                                 
 2                            column=info:name, timestamp=1406190005724, value=lily                                
 3                            column=info:name, timestamp=1406190016967, value=wwn                                 
3 row(s) in 0.0420 seconds


hive访问已经存在的hbase 
使用CREATE EXTERNAL TABLE:
CREATE EXTERNAL TABLE hbase_table_3(key int, value string)    
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "info:name")   
TBLPROPERTIES("hbase.table.name" = "student"); 
hive> CREATE EXTERNAL TABLE hbase_table_3(key int, value string)    
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = "info:name")   
    > TBLPROPERTIES("hbase.table.name" = "student"); 
OK
Time taken: 1.21 seconds
hive> select * from hbase_table_3;
OK
1       tom
2       lily
3       wwn
Time taken: 0.107 seconds, Fetched: 3 row(s)
由上可以看出,hive已经能访问查看hbase中原有的数据了。
注意:如果hbase中列簇名name数据变更,那么hive中查询结果也会相应的变更,如果hbase中不是其他列簇
    内容更新则hive中查询结果不显示。
    
三、多列和多列族(Multiple Columns and Families) 
1.创建数据库


CREATE TABLE hbase_table_add1(key int, value1 string, value2 int, value3 int)    
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:col1,info:col2,city:nu")
TBLPROPERTIES("hbase.table.name" = "student_info");   
登陆hive操作:
hive> CREATE TABLE hbase_table_add1(key int, value1 string, value2 int, value3 int)    
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:col1,info:col2,city:nu")
    > TBLPROPERTIES("hbase.table.name" = "student_info"); 
OK
Time taken: 2.957 seconds
hive> select * from hbase_table_2;                   
OK
Time taken: 1.16 seconds
hive> select * from hbase_table_3;
OK
1       tom
2       lily
3       wwn
4       marry
Time taken: 0.117 seconds, Fetched: 4 row(s)
hive> set hive.cli.print.header=true;                
hive> select * from hbase_table_3;   
OK
hbase_table_3.key       hbase_table_3.value
1       tom
2       lily
3       wwn
4       marry
Time taken: 1.132 seconds, Fetched: 4 row(s)
hive> desc hbase_table_3;
OK
col_name        data_type       comment
key                     int                     from deserializer   
value                   string                  from deserializer   
Time taken: 0.19 seconds, Fetched: 2 row(s)
hive> insert overwrite table hbase_table_add1 select key,value,key+1,value from hbase_table_3;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1406161997851_0003, Tracking URL = http://db96:8088/proxy/application_1406161997851_0003/
Kill Command = /usr/local/hadoop//bin/hadoop job  -kill job_1406161997851_0003
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2014-07-25 08:42:46,068 Stage-0 map = 0%,  reduce = 0%
2014-07-25 08:42:56,218 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 2.77 sec
MapReduce Total cumulative CPU time: 2 seconds 770 msec
Ended Job = job_1406161997851_0003
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 2.77 sec   HDFS Read: 239 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 770 msec
OK
_col0   _col1   _col2   _col3
Time taken: 28.01 seconds
hive> select * from  hbase_table_add1;
OK
hbase_table_add1.key    hbase_table_add1.value1 hbase_table_add1.value2 hbase_table_add1.value3
1       tom     2       NULL
2       lily    3       NULL
3       wwn     4       NULL
4       marry   5       NULL
Time taken: 1.105 seconds, Fetched: 4 row(s)
hive> insert overwrite table hbase_table_add1 select key,value,key+1,key+100 from hbase_table_3;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1406161997851_0004, Tracking URL = http://db96:8088/proxy/application_1406161997851_0004/
Kill Command = /usr/local/hadoop//bin/hadoop job  -kill job_1406161997851_0004
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2014-07-25 08:45:15,164 Stage-0 map = 0%,  reduce = 0%
2014-07-25 08:45:25,609 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 2.69 sec
MapReduce Total cumulative CPU time: 2 seconds 690 msec
Ended Job = job_1406161997851_0004
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 2.69 sec   HDFS Read: 239 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 690 msec
OK
key     value   _c2     _c3
Time taken: 25.587 seconds
hive> select * from hbase_table_add1;
OK
hbase_table_add1.key    hbase_table_add1.value1 hbase_table_add1.value2 hbase_table_add1.value3
1       tom     2       101
2       lily    3       102
3       wwn     4       103
4       marry   5       104
Time taken: 1.122 seconds, Fetched: 4 row(s)


登陆hbase中查看:
hbase(main):001:0> list
TABLE                                                                                                              
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
shivetest                                                                                                           
student                                                                                                            
student_info                                                                                                       
test                                                                                                               
xyz                                                                                                                
5 row(s) in 2.4090 seconds


=> ["hivetest", "student", "student_info", "test", "xyz"]
hbase(main):002:0> scan "student_info"
ROW                           COLUMN+CELL                                                                          
 1                            column=city:nu, timestamp=1406249125147, value=101                                   
 1                            column=info:col1, timestamp=1406249125147, value=tom                                 
 1                            column=info:col2, timestamp=1406249125147, value=2                                   
 2                            column=city:nu, timestamp=1406249125147, value=102                                   
 2                            column=info:col1, timestamp=1406249125147, value=lily                                
 2                            column=info:col2, timestamp=1406249125147, value=3                                   
 3                            column=city:nu, timestamp=1406249125147, value=103                                   
 3                            column=info:col1, timestamp=1406249125147, value=wwn                                 
 3                            column=info:col2, timestamp=1406249125147, value=4                                   
 4                            column=city:nu, timestamp=1406249125147, value=104                                   
 4                            column=info:col1, timestamp=1406249125147, value=marry                               
 4                            column=info:col2, timestamp=1406249125147, value=5                                   
4 row(s) in 0.1110 seconds


hbase(main):003:0> 


这里有3个hive的列,(value1和value2,value3),2个hbase的列簇(info,city)
hive的2列(value,和value2)对应1个hbase的列簇(info,在hbase的列名称col1,col2),
hive的另外1列(value3)对应列nu位于city列簇。
这里实现了hive中表,多列存放到hbase少量固定的列簇中

转载于:https://my.oschina.net/u/580135/blog/612188

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!