Unable to configure ORC properties in Spark

后端 未结 1 1210
余生分开走
余生分开走 2021-01-01 05:56

I am using Spark 1.6 (Cloudera 5.8.2) and tried below methods to configure ORC properties. But it does not effect output.

Below is the code snippet i tried.

相关标签:
1条回答
  • 2021-01-01 06:28

    You are making two different errors here. I don't blame you; I've been there...

    Issue #1
    orc.compress and the rest are not Spark DataFrameWriter options. They are Hive configuration properties, that must be defined before creating the hiveContext object...

    • either in the hive-site.xml available to Spark at launch time
    • or in your code, by re-creating the SparkContext...

     sc.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
     sc.stop
     val scAlt = new org.apache.spark.SparkContext((new org.apache.spark.SparkConf).set("orc.compress","snappy"))
     scAlt.getConf.get("orc.compress","<undefined>") // will now be Snappy
     val hiveContextAlt = new org.apache.spark.sql.SQLContext(scAlt)

    [Edit] with Spark 2.x the script would become...
     spark.sparkContext.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
     spark.close
     val sparkAlt = org.apache.spark.sql.SparkSession.builder().config("orc.compress","snappy").getOrCreate()
     sparkAlt.sparkContext.getConf.get("orc.compress","<undefined>") // will now be Snappy

    Issue #2
    Spark uses its own SerDe libraries for ORC (and Parquet, JSON, CSV, etc) so it does not have to honor the standard Hadoop/Hive properties.

    There are some Spark-specific properties for Parquet, and they are well documented. But again, these properties must be set before creating (or re-creating) the hiveContext.

    For ORC and the other formats, you have to resort to format-specific DataFrameWriter options; quoting the latest JavaDoc...

    You can set the following ORC-specific option(s) for writing ORC files:
    compression (default snappy): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, and lzo). This will override orc.compress

    Note that the default compression codec has changed with Spark 2; before that it was zlib

    So the only thing you can set is the compression codec, using

    dataframe.write().format("orc").option("compression","snappy").save("wtf")
    
    0 讨论(0)
提交回复
热议问题