主要是对CDH6.0.1平台,Hive的压缩进行设置。
采用ORC+Snappy压缩是比较常用的格式,CDH6已经自动部署了Snappy压缩。
Hive表启用压缩
set hive.exec.compress.output=true;
CREATE TABLE `virtual_payment_cp` (
`ID` bigint,
`DEVICE_CODE` string COMMENT 'xx',
`LOGIN_ACCOUNT` string COMMENT 'xx',
`AMOUNT` decimal(11,2) COMMENT 'xx',
`PAY_RESULT` int COMMENT 'xx',
`CP_GAME_ID` bigint COMMENT 'xx'
) PARTITIONED BY(`DATE` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");
Map阶段启用压缩
CDH -> YARN ->配置 -> mapred-site.xml -> mapred-site.xml 的 MapReduce 客户端高级配置代码段(安全阀),添加
<property><name>mapreduce.map.output.compress</name><value>true</value></property>
<property><name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value></property>
或者Hive中设置
set mapred.output.compress=true;
set mapred.output.compression.codec=org apache.hadoop.io.compress.SnappyCodec;
Reduce输出启用压缩
CDH
<property><name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value></property>
<property><name>mapreduce.output.fileoutputformat.compress</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value></property>
Hive
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org apache.hadoop.io.compress.SnappyCodec;
其他压缩,如:
Parquet+Snappy
STORED AS parquet tblproperties("parquet.compression"="SNAPPY");
来源:https://blog.csdn.net/lingeio/article/details/99676710