How can I make the pyspark and SparkSQL to execute the Hive on Spark?

问题

I've installed and set up Spark on Yarn together with integrating Spark with Hive Tables. By using spark-shell / pyspark, I also follow the simple tutorial and achieve to create Hive table, load data and then select properly.

Then I move to the next step, setting Hive on Spark. By using hive / beeline, I also achieve to create Hive table, load data and then select properly. Hive is executed on YARN/Spark properly. How do I know it work? The hive shell displays the following: -

hive> select sum(col1) from test_table;
....
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
          STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
--------------------------------------------------------------------------------------
Stage-0 ........         0      FINISHED      3          3        0        0       0
Stage-1 ........         0      FINISHED      1          1        0        0       0
--------------------------------------------------------------------------------------
STAGES: 02/02    [==========================>>] 100%  ELAPSED TIME: 55.26 s
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 55.26 second(s)
OK
6
Time taken: 99.165 seconds, Fetched: 1 row(s)

The resource manager UI also displays the RUNNING application as Hive on Spark (sessionId = ....) and I am able to visit the ApplicationMaster for looking the detail query as well.

The current step which I cannot achieve yet is integrating the pyspark/SparkSQL to the Hive on Spark.

What I'm trying.

Edit the $SPARK_HOME/conf/hive-site.xml as hive.execution.engine=spark.

    <property>
        <name>hive.execution.engine</name>
        <value>spark</value>
        <description>
            Expects one of [mr, tez, spark].
        </description>
    </property>

>>> spark.sql("set spark.master").show()
+------------+-----+
|         key|value|
+------------+-----+
|spark.master| yarn|
+------------+-----+

>>> spark.sql("set spark.submit.deployMode").show()
+--------------------+------+
|                 key| value|
+--------------------+------+
|spark.submit.depl...|client|
+--------------------+------+

>>> spark.sql("set hive.execution.engine").show()
+--------------------+-----------+
|                 key|      value|
+--------------------+-----------+
|hive.execution.en...|<undefined>|
+--------------------+-----------+

Since, there is no any value for hive.execution.engine (quite surprised ! I've set the hive-site.xml !), I decide to set it manually as the following:-

>>> spark.sql("set hive.execution.engine=spark")
>>> spark.sql("set hive.execution.engine").show()
+--------------------+-----+
|                 key|value|
+--------------------+-----+
|hive.execution.en...|spark|
+--------------------+-----+

Select the data from Hive by using SparkSQL

>>> spark.sql("select sum(col1) from test_table").show()
+---------+
|sum(col1)|
+---------+
|        6|
+---------+

Even the result is shown, but there is no application displayed at resource manager. I understand that SparkSQL does not use the Hive On Spark. I've no any clue about this.

The question is

How can I make the pyspark / SparkSQL to use Hive on Spark?
Is it suitable to do like this for speed up and move away from the mr execution engine?
Do I mix and match with the wrong ingredient? or Is it not possible?

回答1:

"Hive on Spark" is short for "HiveServer2 uses the Spark execution engine by default".

what are the clients of HS2 service? Apps that consider Hive as a regular database, connecting via JDBC (Java/Scala apps such as beeline) or ODBC (R scripts, Windows apps) or DBI (Python apps & scripts), and submitting SQL queries
Does that apply to Spark jobs? No...! Spark wants raw access to the data files. In essence, Spark is its own database engine; there is even the Spark ThriftServer that can be used as a (crude) replacement for HS2.

When Spark is built to interact with Hive V1 or Hive V2, it only interacts with the MetaStore service -- i.e. the metadata catalog that makes it possible for multiple systems (HiveServer2 / Presto / Impala / Spark jobs / Spark ThriftServer / etc) to share the same definition for "databases" and "tables", including the location of the data files (i.e. the HDFS directories / S3 pseudo-directories / etc)

But each system has its own libraries to read and write into the "tables" -- HiveServer2 uses YARN jobs (with a choice of execution engines such as MapReduce, TEZ, Spark); Impala and Presto have their own execution engines running outside of YARN; Spark has its own execution engine running inside or outside of YARN.

And unfortunately these systems do not coordinate their read/write operations, which can be a real mess (i.e. a Hive SELECT query may crash because a Spark job has just deleted a file while rebuilding a partition, and vice-versa), although the Metastore provides an API to manage read/write locks in ZooKeeper. Only HS2 supports that API, apparently, and it's not even active by default.

PS: Hive LLAP is yet another system, that uses YARN with TEZ (no other option) but with an additional layer of persistence and a memory grid for caching -- i.e. not your regular HiveServer2, but an evolution that HortonWorks introduced as a competitor to Impala and Presto.

When Spark is built to interact with Hive V3 "HortonWorks-style", there is a catch:

by default HiveServer2 manages "ACID tables" with a specific data format (an ORC variant) that Spark does not support
by default the Metastore prevents Spark to be aware of any HiveServer2 table, by using different namespaces for HS2 and for Spark -- effectively negating the purpose of having a single, shared catalog...!!
hence Horton provides a dedicated "connector" for Spark to access Hive tables via HS2 -- which negates the purpose of using the Spark execution engine...!!

Since Horton has been absorbed by Cloudera, the future of Spark integration with the Metastore is not clear. Most of the good parts from Horton distro are replacing the lame (or missing) parts from Cloudera; but that specific development was not obviously good.

来源：https://stackoverflow.com/questions/60359882/how-can-i-make-the-pyspark-and-sparksql-to-execute-the-hive-on-spark

标签

python

apache-spark

pyspark

Hive

apache-spark-sql