问题
I've installed and set up Spark on Yarn together with integrating Spark with Hive Tables. By using spark-shell
/ pyspark
, I also follow the simple tutorial and achieve to create Hive table, load data and then select properly.
Then I move to the next step, setting Hive on Spark. By using hive
/ beeline
, I also achieve to create Hive table, load data and then select properly. Hive is executed on YARN/Spark properly. How do I know it work? The hive
shell displays the following: -
hive> select sum(col1) from test_table;
....
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
Stage-0 ........ 0 FINISHED 3 3 0 0 0
Stage-1 ........ 0 FINISHED 1 1 0 0 0
--------------------------------------------------------------------------------------
STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 55.26 s
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 55.26 second(s)
OK
6
Time taken: 99.165 seconds, Fetched: 1 row(s)
The resource manager UI also displays the RUNNING
application as Hive on Spark (sessionId = ....)
and I am able to visit the ApplicationMaster
for looking the detail query as well.
The current step which I cannot achieve yet is integrating the pyspark
/SparkSQL
to the Hive on Spark
.
What I'm trying.
- Edit the
$SPARK_HOME/conf/hive-site.xml
ashive.execution.engine=spark
.
<property>
<name>hive.execution.engine</name>
<value>spark</value>
<description>
Expects one of [mr, tez, spark].
</description>
</property>
- Login to pyspark by using
bin/pyspark
and check thehive.execution.engine
.
>>> spark.sql("set spark.master").show()
+------------+-----+
| key|value|
+------------+-----+
|spark.master| yarn|
+------------+-----+
>>> spark.sql("set spark.submit.deployMode").show()
+--------------------+------+
| key| value|
+--------------------+------+
|spark.submit.depl...|client|
+--------------------+------+
>>> spark.sql("set hive.execution.engine").show()
+--------------------+-----------+
| key| value|
+--------------------+-----------+
|hive.execution.en...|<undefined>|
+--------------------+-----------+
- Since, there is no any value for
hive.execution.engine
(quite surprised ! I've set the hive-site.xml !), I decide to set it manually as the following:-
>>> spark.sql("set hive.execution.engine=spark")
>>> spark.sql("set hive.execution.engine").show()
+--------------------+-----+
| key|value|
+--------------------+-----+
|hive.execution.en...|spark|
+--------------------+-----+
- Select the data from Hive by using SparkSQL
>>> spark.sql("select sum(col1) from test_table").show()
+---------+
|sum(col1)|
+---------+
| 6|
+---------+
- Even the result is shown, but there is no application displayed at resource manager. I understand that
SparkSQL
does not use theHive On Spark
. I've no any clue about this.
The question is
- How can I make the
pyspark
/SparkSQL
to useHive on Spark
? - Is it suitable to do like this for speed up and move away from the
mr
execution engine? - Do I mix and match with the wrong ingredient? or Is it not possible?
回答1:
"Hive on Spark" is short for "HiveServer2 uses the Spark execution engine by default".
- what are the clients of HS2 service? Apps that consider Hive as a regular database, connecting via JDBC (Java/Scala apps such as
beeline
) or ODBC (R scripts, Windows apps) or DBI (Python apps & scripts), and submitting SQL queries - Does that apply to Spark jobs? No...! Spark wants raw access to the data files. In essence, Spark is its own database engine; there is even the Spark ThriftServer that can be used as a (crude) replacement for HS2.
When Spark is built to interact with Hive V1 or Hive V2, it only interacts with the MetaStore service -- i.e. the metadata catalog that makes it possible for multiple systems (HiveServer2 / Presto / Impala / Spark jobs / Spark ThriftServer / etc) to share the same definition for "databases" and "tables", including the location of the data files (i.e. the HDFS directories / S3 pseudo-directories / etc)
But each system has its own libraries to read and write into the "tables" -- HiveServer2 uses YARN jobs (with a choice of execution engines such as MapReduce, TEZ, Spark); Impala and Presto have their own execution engines running outside of YARN; Spark has its own execution engine running inside or outside of YARN.
And unfortunately these systems do not coordinate their read/write operations, which can be a real mess (i.e. a Hive SELECT query may crash because a Spark job has just deleted a file while rebuilding a partition, and vice-versa), although the Metastore provides an API to manage read/write locks in ZooKeeper. Only HS2 supports that API, apparently, and it's not even active by default.
PS: Hive LLAP is yet another system, that uses YARN with TEZ (no other option) but with an additional layer of persistence and a memory grid for caching -- i.e. not your regular HiveServer2, but an evolution that HortonWorks introduced as a competitor to Impala and Presto.
When Spark is built to interact with Hive V3 "HortonWorks-style", there is a catch:
- by default HiveServer2 manages "ACID tables" with a specific data format (an ORC variant) that Spark does not support
- by default the Metastore prevents Spark to be aware of any HiveServer2 table, by using different namespaces for HS2 and for Spark -- effectively negating the purpose of having a single, shared catalog...!!
- hence Horton provides a dedicated "connector" for Spark to access Hive tables via HS2 -- which negates the purpose of using the Spark execution engine...!!
Since Horton has been absorbed by Cloudera, the future of Spark integration with the Metastore is not clear. Most of the good parts from Horton distro are replacing the lame (or missing) parts from Cloudera; but that specific development was not obviously good.
来源:https://stackoverflow.com/questions/60359882/how-can-i-make-the-pyspark-and-sparksql-to-execute-the-hive-on-spark