Parquet vs ORC vs ORC with Snappy

匿名 (未验证) 提交于 2019-12-03 02:44:02

问题:

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy.

I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through.

Follows some details of my data.

Table A- Text File Format- 2.5GB  Table B - ORC - 652MB  Table C - ORC with Snappy - 802MB  Table D - Parquet - 1.9 GB 

Parquet was worst as far as compression for my table is concerned.

My tests with the above tables yielded following results.

Row count operation

Text Format Cumulative CPU - 123.33 sec  Parquet Format Cumulative CPU - 204.92 sec  ORC Format Cumulative CPU - 119.99 sec   ORC with SNAPPY Cumulative CPU - 107.05 sec 

Sum of a column operation

Text Format Cumulative CPU - 127.85 sec     Parquet Format Cumulative CPU - 255.2 sec     ORC Format Cumulative CPU - 120.48 sec     ORC with SNAPPY Cumulative CPU - 98.27 sec 

Average of a column operation

Text Format Cumulative CPU - 128.79 sec  Parquet Format Cumulative CPU - 211.73 sec      ORC Format Cumulative CPU - 165.5 sec     ORC with SNAPPY Cumulative CPU - 135.45 sec  

Selecting 4 columns from a given range using where clause

Text Format Cumulative CPU -  72.48 sec   Parquet Format Cumulative CPU - 136.4 sec         ORC Format Cumulative CPU - 96.63 sec   ORC with SNAPPY Cumulative CPU - 82.05 sec  

Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

Thanks!

回答1:

I would say, that both of these formats have their own advantages.

Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file-structure is flattened.

And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.

The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset? If yes it looks like there is something shady about it, when it only compresses to 1.9 GB



回答2:

You are seeing this because:

  • Hive has a vectorized ORC reader but no vectorized parquet reader.

  • Spark has a vectorized parquet reader and no vectorized ORC reader.

  • Spark performs best with parquet, hive performs best with ORC.

I've seen similar differences when running ORC and Parquet with Spark.

Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization.

(correct as of Hive 2.0 and Spark 2.1)



回答3:

Two biggest considerations for ORC over Parquet in hive are:

Many of the performance improvements provided in the Stinger initiative are dependent on features of the ORC format including block level index for each column. This leads to potentially more efficient I/O allowing Hive to skip reading entire blocks of data if it determines predicate values are not present there. Also the Cost Based Optimizer has the ability to consider column level metadata present in ORC files in order to generate the most efficient graph.

ACID transactions are only possible when using ORC as the file format.

Couple of considerations for Parquet over ORC in Spark are: 1) Easily creation of Dataframes in spark. No need to specify schemas. 2) Worked on highly nested data.

Spark and Parquet is good combination



回答4:

We did some benchmark comparing the different file formats (Avro, JSON, ORC, and Parquet) in different use cases.

https://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet

The data is all publicly available and benchmark code is all open source at:

https://github.com/apache/orc/tree/branch-1.4/java/bench



回答5:

Both of them have their advantages. We use Parquet at work together with Hive and Impala, but just wanted to point a few advantages of ORC over Parquet: during long-executing queries, when Hive queries ORC tables GC is called about 10 times less frequently. Might be nothing for many projects, but might be crucial for others.

ORC also takes much less time, when you need to select just a few columns from the table. Some other queries, especially with joins, also take less time because of vectorized query execution, which is not available for Parquet

Also, ORC compression is sometimes a bit random, while Parquet compression is much more consistent. It looks like when ORC table has many number columns - it doesn't compress as well. It affects both zlib and snappy compression



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!