How can I insert into a hive table with parquet fileformat and SNAPPY compression?

匿名 (未验证) 提交于 2019-12-03 01:37:02

问题:

Hive 2.1

I have following table definition :

CREATE EXTERNAL TABLE table_snappy ( a STRING, b INT)  PARTITIONED BY (c STRING) ROW FORMAT SERDE   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/' TBLPROPERTIES ('parquet.compress'='SNAPPY'); 

Now, I would like to insert data into it :

INSERT INTO table_snappy PARTITION (c='something') VALUES ('xyz', 1); 

However, when I look into the data file, all I see is plain parquet file without any compression. How can I enable snappy compression in this case?

Goal : To have hive table data in parquet format and SNAPPY compressed.

I have tried setting multiple properties as well :

SET parquet.compression=SNAPPY; SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.type=BLOCK; SET mapreduce.output.fileoutputformat.compress=true; SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec; SET PARQUET_COMPRESSION_CODEC=snappy; 

as well as

TBLPROPERTIES ('parquet.compression'='SNAPPY'); 

but nothing is being helpful. I tried the same with GZIP compression and it seem to be not working as well. I am starting to think if it's possible or not. Any help is appreciated.

回答1:

One of the best ways to check if it is compressed or not, is by using parquet-tools.

create external table testparquet (id int, name string)    stored as parquet    location '/user/cloudera/testparquet/'   tblproperties('parquet.compression'='SNAPPY');  insert into testparquet values(1,'Parquet'); 

Now when you look at the file, it may not have .snappy anywhere

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/testparquet Found 1 items -rwxr-xr-x   1 anonymous supergroup        323 2018-03-02 01:07 /user/cloudera/testparquet/000000_0 

Let's inspect it further...

[cloudera@quickstart ~]$ hdfs dfs -get /user/cloudera/testparquet/* [cloudera@quickstart ~]$ parquet-tools meta 000000_0  creator:     parquet-mr version 1.5.0-cdh5.12.0 (build ${buildNumber})   file schema: hive_schema  ------------------------------------------------------------------------------------------------------------------------------------------------------------- id:          OPTIONAL INT32 R:0 D:1 name:        OPTIONAL BINARY O:UTF8 R:0 D:1  row group 1: RC:1 TS:99  ------------------------------------------------------------------------------------------------------------------------------------------------------------- id:           INT32 SNAPPY DO:0 FPO:4 SZ:45/43/0.96 VC:1 ENC:PLAIN,RLE,BIT_PACKED name:         BINARY SNAPPY DO:0 FPO:49 SZ:58/56/0.97 VC:1 ENC:PLAIN,RLE,BIT_PACKED [cloudera@quickstart ~]$  

it is snappy compressed.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!