Is there any proper resource from where we can understand explain plan generated by hive completely? I have tried searching it in the wiki but could not find a complete guid
I will try to explain a litte what I know.
The execution plan is a description of the tasks required for a query, the order in which they'll be executed, and some details about each task.
To see an execution plan for a query, you can do this, prefix the query with the keyword EXPLAIN
, then run it.
Execution plans can be long and complex.
Fully understanding them requires a deep knowledge of MapReduce
.
Example
EXPLAIN CREATE TABLE flights_by_carrier AS
SELECT carrier, COUNT(flight) AS num
FROM flights
GROUP BY carrier;
This query is a CTAS statement
that creates a new table named flights_by_carrier and populates it with the result of a SELECT query
.
The SELECT query
groups the rows of the flights table by carrier and returns each carrier and the number of flights for that carrier.
Hive's output of the EXPLAIN
statement for the example is shown here
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| Stage-3 depends on stages: Stage-0 |
| Stage-2 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: flights |
| Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: carrier (type: string), flight (type: smallint) |
| outputColumnNames: carrier, flight |
| Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count(flight) |
| keys: carrier (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: string) |
| Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| keys: KEY._col0 (type: string) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 30696411 Data size: 481091680 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 30696411 Data size: 481091680 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| name: fly.flights_by_carrier |
| |
| Stage: Stage-0 |
| Move Operator |
| files: |
| hdfs directory: true |
| destination: hdfs://localhost:8020/user/hive/warehouse/fly.db/flights_by_carrier |
| |
| Stage: Stage-3 |
| Create Table Operator: |
| Create Table |
| columns: carrier string, num bigint |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat |
| serde name: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| name: fly.flights_by_carrier |
| |
| Stage: Stage-2 |
| Stats-Aggr Operator |
| |
+----------------------------------------------------+--+
Stage Dependencies
The example query will execute in four stages
, Stage-0 to Stage-3.
Each stage
could be a MapReduce
job, an HDFS
action, a metastore
action, or some other action performed by the Hive server
.
The numbering does not imply an order of execution or dependency.
The dependencies between stages determine the order in which they must execute, and Hive
specifies these dependencies explicitly at the start of the EXPLAIN
results.
A root stage, like Stage-1 in this example, has no dependencies and is free to run first.
Non-root stages cannot run until the stages upon which they depend have completed.
Stage Plans
The stage plans part of the output shows descriptions of the stages.
For Hive
, read them by starting at the top and then going down.
Stage-1 is identified as a MapReduce
job.
The query plan shows that this job includes both a map phase
(described by the Map Operator Tree) and a reduce phase
(described by the Reduce Operator Tree).
In the map phase
, the map tasks read the flights table and select the carrier and flights columns.
This data is passed to the reduce phase
, in which the reduce tasks group the data by carrier and aggregate it by counting flights.
Following Stage-1 is Stage-0, which is an HDFS
action (Move).
In this stage, Hive
moves the output of the previous stage to a new subdirectory in the warehouse directory in HDFS.
This is the storage directory for the new table that will be named flights_by_carrier.
Following Stage-0 is Stage-3, which is a metastore
action:
Create Table.
In this stage, Hive
creates a new table named flights_by_carrier in the fly database.
The table has two columns: a STRING
column named carrier and a BIGINT
column named num.
The final stage, Stage-2, collects statistics.
The details of this final stage are not important, but it gathers information such as the number of rows in the table, the number of files that store the table data in HDFS
, and the number of unique values in each column in the table.
These statistics can be used to optimize Hive
queries.