I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why?
Mike
tl;dr : I would read it straight from the parquet files
I am using Spark 1.5.2 and Hive 1.2.1 For a 5Million row X 100 column table some timings I've recorded are
val dffile = sqlContext.read.parquet("/path/to/parquets/*.parquet")
val dfhive = sqlContext.table("db.table")
dffile count --> 0.38s; dfhive count --> 8.99s
dffile sum(col) --> 0.98s; dfhive sum(col) --> 8.10s
dffile substring(col) --> 2.63s; dfhive substring(col) --> 7.77s
dffile where(col=value) --> 82.59s; dfhive where(col=value) --> 157.64s
Note that these were done with an older version of Hive and an older version of Spark so I can't comment on how speed improvements could have occurred between the two reading mechanisms
From what I understand, even though in general .ORC
is better suited for flat structures and parquet
for nested ones, spark
is optimised towards parquet
. Therefore, it is advised to use that format with spark
.
Furthermore, Metadata
for all your read tables from parquet
will be stored in hive
anyway. This is spark doc:Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.
I tend to transform data asap into parquet
format and store it alluxio
backed by hdfs
. This allows me to achieve better performance for read/write
operations, and limit using cache
.
I hope it helps.
来源:https://stackoverflow.com/questions/44120162/is-it-better-for-spark-to-select-from-hive-or-select-from-file