How to set ORC stripe size in Spark

旧巷老猫 提交于 2019-12-08 03:45:15

问题


I am trying to generate a dataset in Spark(2.3) and write it in ORC file format. I'm trying to set some properties for ORC stripe size and compress size. I took hints from this SO post. But spark is not honoring those properties and my stripe size in the resulting ORC files is much lower than what I've set.

val conf: SparkConf = new SparkConf().setAppName("App")
  .set("spark.sql.orc.impl", "native")
  .set("spark.sql.hive.convertMetastoreOrc", "true")
  .set("spark.sql.orc.stripe.size", "67108864")
  .set("spark.sql.orc.compress.size", "262144")
  .set("orc.stripe.size", "67108864")
  .set("orc.compress.size", "262144")

data.sortWithinPartitions("column")
  .write
  .option("orc.compress", "ZLIB")
  .mode("overwrite")
  .format("org.apache.spark.sql.execution.datasources.orc")
  .save(outputPath)

I also tried to write the data as:

data.sortWithinPartitions("column")
  .write
  .option("orc.compress", "ZLIB")
  .option("orc.stripe.size", "67108864")
  .option("orc.compress.size", "262144")
  .mode("overwrite")
  .format("org.apache.spark.sql.execution.datasources.orc")
  .save(outputPath)

But no luck.

Relevant sections from ORC file dump:

File Version: 0.12 with ORC_135
Rows: 3174228
Compression: ZLIB
Compression size: 32768
...
Stripe: offset: 3 data: 6601333 rows: 30720 tail: 2296 index: 16641
Stripe: offset: 6620273 data: 6016778 rows: 25600 tail: 2279 index: 13595
Stripe: offset: 12652925 data: 6031290 rows: 25600 tail: 2284 index: 13891
Stripe: offset: 18700390 data: 6132228 rows: 25600 tail: 2283 index: 13805
Stripe: offset: 24848706 data: 6066176 rows: 25600 tail: 2267 index: 13855
Stripe: offset: 30931004 data: 6562819 rows: 30720 tail: 2308 index: 16851
Stripe: offset: 37512982 data: 6462380 rows: 30720 tail: 2304 index: 16994
Stripe: offset: 43994660 data: 6655346 rows: 30720 tail: 2291 index: 17031

回答1:


The following works on Spark 2.4.4.

spark = (SparkSession
     .builder
     .config('hive.exec.orc.default.stripe.size', 64*1024*1024)
     .getOrCreate()
     )
df = ...
df.write.format('orc').save('output.orc')



回答2:


I've had the same issue and in my case it appears to pertain to the version of Hortonworks HDP used. In this post you can see a similar discussion, where they suggest using HDP 2.6.3+ with Spark 2.2+, that utilizes the newer Hive libraries:

https://community.hortonworks.com/questions/159893/spark-orc-stripe-size.html

Perhaps your Spark 2.3 is still configured to use the older Hive 1.2.1 library.



来源:https://stackoverflow.com/questions/52075481/how-to-set-orc-stripe-size-in-spark

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!