After reading What is hive, Is it a database?, a colleague yesterday mentioned that he was able to filter a 15B table, join it with another table after doing a "group by", which resulted in 6B records, in only 10 minutes! I wonder if this would be slower in Spark, since now with the DataFrames, they may be comparable, but I am not sure, thus the question.
Is Hive faster than Spark? Or this question doesn't have meaning? Sorry, for my ignorance.
He uses the latest Hive, which from seems to be using Tez.
Hive is just a framework that gives sql functionality to MapReduce type workloads.
These workloads can run on mapreduce or yarn.
So comparing Hive on tez vs Hive on spark. Nice article below discussing this When to go with ETL on Hive using Tez VS When to go with Spark ETL? (Gist use Hive on spark if not sure).
Lower the better
Spark is convenient but does not handle scale all that well as regards SQL performance.
Hive has amazing support for co-partitioned joins. When the tables you were joining have hundreds of millions to billions of rows you will really appreciate the fine grained join support via:
- similar
distribute by
andsort by
(orcluster by
) bucketed joins
Hive has extensive support for metadata-only queries
: Spark has only had a glimmer of it since 2.1
Spark runs out of steam quickly when the number of partitions exceeds maybe 10K+. Hive does not suffer from this limitation.
Fast forward to 2018, Hive is much faster (and more stable) than SparkSQL, especially in concurrent environments, according to the following article:
https://mr3.postech.ac.kr/blog/2018/10/31/performance-evaluation-0.4/
The article compares several SQL-on-Hadoop systems using the TPC-DS benchmark (1TB, 3TB, 10TB) using three clusters (11 nodes, 21 nodes, 42 nodes):
- Hive-LLAP included in HDP(Hortonworks Data Platform) 2.6.4
- Hive-LLAP included in HDP 3.0.1
- Presto 0.203e (with cost-based optimization enabled)
- Presto 0.208e (with cost-based optimization enabled)
- SparkSQL 2.2.0 included in HDP 2.6.4
- SparkSQL 2.3.1 included in HDP 3.0.1
- Hive 3.1.0 running on top of Tez
- Hive on Tez included in HDP 3.0.1
- Hive 3.1.0 running on top of MR3 0.4
- Hive 2.3.3 running on top of MR3 0.4
So, in comparison with Hive-based systems and Presto, SparkSQL is very slow and does not scale in concurrent environments. (Note that the experiment uses SparkSQL running on vanilla Spark.)
来源:https://stackoverflow.com/questions/39416007/is-hive-faster-than-spark