Is it better for Spark to select from hive or select from file

依然范特西╮ 提交于 2019-12-04 04:08:15

tl;dr : I would read it straight from the parquet files

I am using Spark 1.5.2 and Hive 1.2.1 For a 5Million row X 100 column table some timings I've recorded are

val dffile = sqlContext.read.parquet("/path/to/parquets/*.parquet")
val dfhive = sqlContext.table("db.table")

dffile count --> 0.38s; dfhive count --> 8.99s

dffile sum(col) --> 0.98s; dfhive sum(col) --> 8.10s

dffile substring(col) --> 2.63s; dfhive substring(col) --> 7.77s

dffile where(col=value) --> 82.59s; dfhive where(col=value) --> 157.64s

Note that these were done with an older version of Hive and an older version of Spark so I can't comment on how speed improvements could have occurred between the two reading mechanisms

From what I understand, even though in general .ORC is better suited for flat structures and parquet for nested ones, spark is optimised towards parquet. Therefore, it is advised to use that format with spark.

Furthermore, Metadata for all your read tables from parquet will be stored in hiveanyway. This is spark doc:Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.

I tend to transform data asap into parquet format and store it alluxio backed by hdfs. This allows me to achieve better performance for read/write operations, and limit using cache.

I hope it helps.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!