Spark compression when writing to external Hive table

问题

I'm inserting into an external hive-parquet table from Spark 2.1 (using df.write.insertInto(...). By setting e.g.

spark.sql("SET spark.sql.parquet.compression.codec=GZIP")

I can switch between SNAPPY,GZIP and uncompressed. I can verify that the file size (and filename ending) is influenced by these settings. I get a file named e.g.

part-00000-5efbfc08-66fe-4fd1-bebb-944b34689e70.gz.parquet

However if I work with partitioned Hive table, this setting does not have any effect, the file size is always the same. In addition, the filename is always

part-00000

Now how can I change (or at least verify) the compression codec of the parquet files in the partitioned case?

My table is :

CREATE EXTERNAL TABLE `test`(`const` string, `x` int)
PARTITIONED BY (`year` int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

回答1:

As you create external table, I would proceed like this :

First write your parquet dataset with the required compression:

df.write
 .partitionBy("year")
 .option("compression","<gzip|snappy|none>")
 .parquet("<parquet_file_path>")

you can check as before with the file extension. Then,you can create your external table as follow :

CREATE EXTERNAL TABLE `test`(`const` string, `x` int)
PARTITIONED BY (`year` int)
STORED AS PARQUET
LOCATION '<parquet_file_path>';

If the external table already exists in Hive, you just need to run to refresh your table:

MSCK REPAIR TABLE test;

来源：https://stackoverflow.com/questions/54023847/spark-compression-when-writing-to-external-hive-table

标签

apache-spark

Hive

parquet