hive基础知识五

匿名 (未验证) 提交于 2019-12-03 00:13:02

Hive 主流文件存储格式对比

1、存储文件的压缩比测试

1.1 测试数据
https://github.com/liufengji/Compression_Format_Data  log.txt 大小为18.1 M
1.2 TextFile
  • 创建表,存储数据格式为TextFile

create table log_text ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as textfile ;
  • 向表中加载数据

load data local inpath '/home/hadoop/log.txt' into table log_text ;
  • 查看表的数据量大小

dfs -du -h /user/hive/warehouse/log_text;  +------------------------------------------------+--+ |                   DFS Output                   | +------------------------------------------------+--+ | 18.1 M  /user/hive/warehouse/log_text/log.txt  | +------------------------------------------------+--+
1.3 Parquet
  • 创建表,存储数据格式为 parquet

create table log_parquet  ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as parquet;
  • 向表中加载数据

insert into table log_parquet select * from log_text;
  • 查看表的数据量大小

dfs -du -h /user/hive/warehouse/log_parquet;  +----------------------------------------------------+--+ |                     DFS Output                     | +----------------------------------------------------+--+ | 13.1 M  /user/hive/warehouse/log_parquet/000000_0  | +----------------------------------------------------+--+
1.4 ORC
  • 创建表,存储数据格式为ORC

create table log_orc  ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as orc  ;
  • 向表中加载数据

insert into table log_orc select * from log_text ;
  • 查看表的数据量大小

dfs -du -h /user/hive/warehouse/log_orc; +-----------------------------------------------+--+ |                  DFS Output                   | +-----------------------------------------------+--+ | 2.8 M  /user/hive/warehouse/log_orc/000000_0  | +-----------------------------------------------+--+
1.5 存储文件的压缩比总结
ORC >  Parquet >  textFile

2、存储文件的查询速度测试

2.1 TextFile
select count(*) from log_text; +---------+--+ |   _c0   | +---------+--+ | 100000  | +---------+--+ 1 row selected (16.99 seconds)
2.2 Parquet
select count(*) from log_parquet; +---------+--+ |   _c0   | +---------+--+ | 100000  | +---------+--+ 1 row selected (17.994 seconds)
2.3 ORC
select count(*) from log_orc; +---------+--+ |   _c0   | +---------+--+ | 100000  | +---------+--+ 1 row selected (15.943 seconds)
2.4 存储文件的查询速度总结
ORC > TextFile > Parquet

3、存储和压缩结合

3.1 创建一个非压缩的的ORC存储方式表
  • 1、创建一个非压缩的的ORC表

create table log_orc_none ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as orc tblproperties("orc.compress"="NONE") ;
  • 2、加载数据

insert into table log_orc_none select * from log_text ;
  • 3、查看表的数据量大小

dfs -du -h /user/hive/warehouse/log_orc_none; +----------------------------------------------------+--+ |                     DFS Output                     | +----------------------------------------------------+--+ | 7.7 M  /user/hive/warehouse/log_orc_none/000000_0  | +----------------------------------------------------+--+
3.2 创建一个snappy压缩的ORC存储方式表
  • 1、创建一个snappy压缩的的ORC表

create table log_orc_snappy ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as orc tblproperties("orc.compress"="SNAPPY") ;
  • 2、加载数据

insert into table log_orc_snappy select * from log_text ;
  • 3、查看表的数据量大小

dfs -du -h /user/hive/warehouse/log_orc_snappy; +------------------------------------------------------+--+ |                      DFS Output                      | +------------------------------------------------------+--+ | 3.8 M  /user/hive/warehouse/log_orc_snappy/000000_0  | +------------------------------------------------------+--+
3.3 创建一个ZLIB压缩的ORC存储方式表
  • 不指定压缩格式的就是默认的采用ZLIB压缩

    • 可以参考上面创建的 log_orc 表

  • 查看表的数据量大小

dfs -du -h /user/hive/warehouse/log_orc; +-----------------------------------------------+--+ |                  DFS Output                   | +-----------------------------------------------+--+ | 2.8 M  /user/hive/warehouse/log_orc/000000_0  | +-----------------------------------------------+--+
3.4 存储方式和压缩总结
  • orc 默认的压缩方式ZLIB比Snappy压缩的还小。

  • 在实际的项目开发当中,hive表的数据存储方式一般选择:orc或parquet

  • 由于snappy的压缩和解压缩 效率都比较高,压缩方式一般选择snappy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!