Is it better for Spark to select from hive or select from file

无人久伴 提交于 2019-12-12 08:42:03

问题


I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why?

Mike


回答1:


tl;dr : I would read it straight from the parquet files

I am using Spark 1.5.2 and Hive 1.2.1 For a 5Million row X 100 column table some timings I've recorded are

val dffile = sqlContext.read.parquet("/path/to/parquets/*.parquet")
val dfhive = sqlContext.table("db.table")

dffile count --> 0.38s; dfhive count --> 8.99s

dffile sum(col) --> 0.98s; dfhive sum(col) --> 8.10s

dffile substring(col) --> 2.63s; dfhive substring(col) --> 7.77s

dffile where(col=value) --> 82.59s; dfhive where(col=value) --> 157.64s

Note that these were done with an older version of Hive and an older version of Spark so I can't comment on how speed improvements could have occurred between the two reading mechanisms




回答2:


From what I understand, even though in general .ORC is better suited for flat structures and parquet for nested ones, spark is optimised towards parquet. Therefore, it is advised to use that format with spark.

Furthermore, Metadata for all your read tables from parquet will be stored in hiveanyway. This is spark doc:Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.

I tend to transform data asap into parquet format and store it alluxio backed by hdfs. This allows me to achieve better performance for read/write operations, and limit using cache.

I hope it helps.



来源:https://stackoverflow.com/questions/44120162/is-it-better-for-spark-to-select-from-hive-or-select-from-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!