I am using Spark 1.6 (Cloudera 5.8.2) and tried below methods to configure ORC properties. But it does not effect output.
Below is the code snippet i tried.
You are making two different errors here. I don't blame you; I've been there...
Issue #1
orc.compress
and the rest are not Spark DataFrameWriter
options. They are Hive configuration properties, that must be defined before creating the hiveContext
object...
hive-site.xml
available to Spark at launch timeSparkContext
... sc.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
sc.stop
val scAlt = new org.apache.spark.SparkContext((new org.apache.spark.SparkConf).set("orc.compress","snappy"))
scAlt.getConf.get("orc.compress","<undefined>") // will now be Snappy
val hiveContextAlt = new org.apache.spark.sql.SQLContext(scAlt)
[Edit] with Spark 2.x the script would become...
spark.sparkContext.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
spark.close
val sparkAlt = org.apache.spark.sql.SparkSession.builder().config("orc.compress","snappy").getOrCreate()
sparkAlt.sparkContext.getConf.get("orc.compress","<undefined>") // will now be Snappy
Issue #2
Spark uses its own SerDe libraries for ORC (and Parquet, JSON, CSV, etc) so it does not have to honor the standard Hadoop/Hive properties.
There are some Spark-specific properties for Parquet, and they are well documented. But again, these properties must be set before creating (or re-creating) the hiveContext
.
For ORC and the other formats, you have to resort to format-specific DataFrameWriter
options; quoting the latest JavaDoc...
You can set the following ORC-specific option(s) for writing ORC files:
•compression
(defaultsnappy
): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none
,snappy
,zlib
, andlzo
). This will overrideorc.compress
Note that the default compression codec has changed with Spark 2; before that it was zlib
So the only thing you can set is the compression codec, using
dataframe.write().format("orc").option("compression","snappy").save("wtf")